Skip to content

Direct FFT for suitable modulus#2681

Open
user202729 wants to merge 2 commits into
flintlib:mainfrom
user202729:direct-fft
Open

Direct FFT for suitable modulus#2681
user202729 wants to merge 2 commits into
flintlib:mainfrom
user202729:direct-fft

Conversation

@user202729
Copy link
Copy Markdown
Contributor

The fft_small code should work with any prime modulus p such that p-1 has sufficiently high 2-valuation, but currently it is only used when it is one of the precomputed primes. See the comments in code for more detail.

For benchmarking, one way to compare the old and the new code is by adding return 0 in the newly-added _nmod_poly_should_directly_fft.

This only speeds things up in the specialized case of working modulo a prime good for FFT, though. Which is not exactly common.

@fredrik-johansson
Copy link
Copy Markdown
Collaborator

This looks good. Thanks for implementing it! I'll experiment a bit with it before merging.

Minor question: why limit to moduli greater than 20 bits? For sufficiently small moduli the exact convolution always fits a single default FFT prime, so we don't lose much by going through the standard path. But one could imagine wanting to use FFT directly for something like a length 2^18 convolution modulo an 18-bit FFT prime, which barely doesn't fit.

This only speeds things up in the specialized case of working modulo a prime good for FFT, though. Which is not exactly common.

It is not common because we don't take advantage of it currently. A function like fmpz_poly_gcd_modular should benefit significantly from choosing only FFT-friendly moduli.

The drawback of the solution in this PR seems to be that one has to create an FFT context object, do a primality test etc. for each multiplication. This shouldn't matter that much for big multiplications except that computing the root of unity tables on the fly will be a bit slower than with one of the 8 primes in the cached "mpn" context. But the overhead might be significant for relatively short multiplications, say of length around 1000, where direct FFT acceleration would be really interesting for many algorithms. Note that fmpz_poly_mul (and indirectly arb_poly_mul etc.) would benefit from being able to use CRT instead of KS for much larger coefficients than is possible with the current hardcoded limit of 8 CRT primes.

We should consider some combination of the following:

  • A big table of "builtin" FFT primes (say 10000 of them) for quick access by multimodular algorithms, or a cache of such primes generated at runtime

  • A cache of context objects for FFT primes (either for recently used primes or for primes in a builtin table), with quick hash table lookup inside the nmod_poly multiplication routines

  • A cache of multiple "mpn" contexts. Currently we have one such context object giving fast access to 8 FFT primes with precomputed CRT tables. We could have a cache of many such contexts with disjoint sets of 8 primes. The benefit is that we could leverage the existing mod/CRT code in the fft_small module as an efficient basecase for divide-and-conquer mod/CRT when we need bigger coefficients.

(None of these need to be done in this PR.)

@user202729
Copy link
Copy Markdown
Contributor Author

user202729 commented May 18, 2026

For sufficiently small moduli the exact convolution always fits a single default FFT prime, so we don't lose much by going through the standard path. But one could imagine wanting to use FFT directly for something like a length 2^18 convolution modulo an 18-bit FFT prime, which barely doesn't fit.

Sounds about right (probably I was just too conservative), I should double check. Note that there's also a packed coefficient branch for (very) small primes elsewhere too. On the other hand, the combination of n_trailing_zeros(mod.n - 1) < n_max(depth, SD_FFT_CTX_W2TAB_INIT) and bn < 1500 already imply NMOD_BITS(mod) >= 12 (maybe ± 1).

{edit}: if it fits a single default FFT prime, then it should be always better to use it because you don't pay the cost of constructing the temporary context, the modular multiplication likely have the same cost regardless of the prime. I think the best solution here is to check exactly whether np = 1 later, but thinking about it, the check here is a bit too conservative

    /* need prod_of_primes >= blen * 4^modbits */
    for (np = 1; np < 4; np++)
    {
        if (flint_mpn_cmp_ui_2exp(crt_data_prod_primes(R->crts + np - 1),
              R->crts[np - 1].coeff_len, bn, 2*modbits) >= 0)
        {
            break;
        }
    }

it just compares bn * 2^(2*modbits) with the product of the primes, but bn * 2^(2*modbits) may be up to 4 times larger than bn * mod^2. Of course computing bn * mod^2 exactly needs about 3 limb multiplications and a few additions, but using double arithmetic and add {small constant, to be determined}*machine epsilon should be fine, since it's at most 2^(64*3), so fits in a double.

But the overhead might be significant for relatively short multiplications, say of length around 1000.

Indeed. Yet another alternative is to have the cache inside the nmod_t object, though this is an ABI-breaking change and increases every nmod_t object by one pointer (and it feels a bit too specialized).

@fredrik-johansson
Copy link
Copy Markdown
Collaborator

For sufficiently small moduli the exact convolution always fits a single default FFT prime, so we don't lose much by going through the standard path. But one could imagine wanting to use FFT directly for something like a length 2^18 convolution modulo an 18-bit FFT prime, which barely doesn't fit.

Sounds about right (probably I was just too conservative), I should double check. Note that there's also a packed coefficient branch for (very) small primes elsewhere too. On the other hand, the combination of n_trailing_zeros(mod.n - 1) < n_max(depth, SD_FFT_CTX_W2TAB_INIT) and bn < 1500 already imply NMOD_BITS(mod) >= 12 (maybe ± 1).

{edit}: if it fits a single default FFT prime, then it should be always better to use it because you don't pay the cost of constructing the temporary context, the modular multiplication likely have the same cost regardless of the prime. I think the best solution here is to check exactly whether np = 1 later, but thinking about it, the check here is a bit too conservative

It should be a bit better to use the given prime for the FFT instead of a default prime if you already have the context object, as the final output modular reduction should be a bit faster.

    /* need prod_of_primes >= blen * 4^modbits */
    for (np = 1; np < 4; np++)
    {
        if (flint_mpn_cmp_ui_2exp(crt_data_prod_primes(R->crts + np - 1),
              R->crts[np - 1].coeff_len, bn, 2*modbits) >= 0)
        {
            break;
        }
    }

it just compares bn * 2^(2*modbits) with the product of the primes, but bn * 2^(2*modbits) may be up to 4 times larger than bn * mod^2. Of course computing bn * mod^2 exactly is not cheap, but using double arithmetic and add {small constant, to be determined}*machine epsilon should be fine, since it's at most 2^(64*3), so fits in a double.

Yeah, this bounds check really doesn't need so sloppy. It should just compute the exact upper bound with a handful of limb operations and do an mpn_cmp (the speedup using double would be minimal). This is already done in the CRT code:

/* Bound coefficients before modular reduction. */

The packing code for tiny moduli in _nmod_poly_mulmid_fft_small_repack even tries a Cauchy-Schwarz bound to gain an extra bit or so of wiggle room, but for larger primes this is probably not worth the overhead.

But the overhead might be significant for relatively short multiplications, say of length around 1000.

Indeed. Yet another alternative is to have the cache inside the nmod_t object, though this is an ABI-breaking change and increases every nmod_t object by one pointer (and it feels a bit too specialized).

Having some kind of extended nmod object that would be passed by reference has been discussed elsewhere. When/if we add something like this, an optional pointer to an FFT context inside this object would indeed be a possible solution.

@user202729
Copy link
Copy Markdown
Contributor Author

user202729 commented May 19, 2026

Also, it's not clear whether pushing the number of FFT primes much further than 8 is worth it. Note that each CRT needs k^2 limb multiplications where k is the number of primes (if you divide and conquer, it's probably k log^2 k, but that assumes nested FFT/Karatsuba subroutine, which is unlikely to be effective in practice)

@fredrik-johansson
Copy link
Copy Markdown
Collaborator

Also, it's not clear whether pushing the number of FFT primes much further than 8 is worth it.

I'm certain that it's worth it; fmpz_poly_mul currently slows down a lot when it switches from fft_small to mul_KS. Note that Kronecker substitution is basically doubling the size of the convolution we need to compute. That means using more primes is probably going to win even quite a bit into the range where CRT dominates the FFTs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants