Direct FFT for suitable modulus by user202729 · Pull Request #2681 · flintlib/flint

user202729 · 2026-05-16T06:22:03Z

The fft_small code should work with any prime modulus p such that p-1 has sufficiently high 2-valuation, but currently it is only used when it is one of the precomputed primes. See the comments in code for more detail.

For benchmarking, one way to compare the old and the new code is by adding return 0 in the newly-added _nmod_poly_should_directly_fft.

This only speeds things up in the specialized case of working modulo a prime good for FFT, though. Which is not exactly common.

fredrik-johansson · 2026-05-18T08:10:09Z

This looks good. Thanks for implementing it! I'll experiment a bit with it before merging.

Minor question: why limit to moduli greater than 20 bits? For sufficiently small moduli the exact convolution always fits a single default FFT prime, so we don't lose much by going through the standard path. But one could imagine wanting to use FFT directly for something like a length 2^18 convolution modulo an 18-bit FFT prime, which barely doesn't fit.

This only speeds things up in the specialized case of working modulo a prime good for FFT, though. Which is not exactly common.

It is not common because we don't take advantage of it currently. A function like fmpz_poly_gcd_modular should benefit significantly from choosing only FFT-friendly moduli.

The drawback of the solution in this PR seems to be that one has to create an FFT context object, do a primality test etc. for each multiplication. This shouldn't matter that much for big multiplications except that computing the root of unity tables on the fly will be a bit slower than with one of the 8 primes in the cached "mpn" context. But the overhead might be significant for relatively short multiplications, say of length around 1000, where direct FFT acceleration would be really interesting for many algorithms. Note that fmpz_poly_mul (and indirectly arb_poly_mul etc.) would benefit from being able to use CRT instead of KS for much larger coefficients than is possible with the current hardcoded limit of 8 CRT primes.

We should consider some combination of the following:

A big table of "builtin" FFT primes (say 10000 of them) for quick access by multimodular algorithms, or a cache of such primes generated at runtime
A cache of context objects for FFT primes (either for recently used primes or for primes in a builtin table), with quick hash table lookup inside the nmod_poly multiplication routines
A cache of multiple "mpn" contexts. Currently we have one such context object giving fast access to 8 FFT primes with precomputed CRT tables. We could have a cache of many such contexts with disjoint sets of 8 primes. The benefit is that we could leverage the existing mod/CRT code in the fft_small module as an efficient basecase for divide-and-conquer mod/CRT when we need bigger coefficients.

(None of these need to be done in this PR.)

user202729 · 2026-05-18T11:27:37Z

For sufficiently small moduli the exact convolution always fits a single default FFT prime, so we don't lose much by going through the standard path. But one could imagine wanting to use FFT directly for something like a length 2^18 convolution modulo an 18-bit FFT prime, which barely doesn't fit.

Sounds about right (probably I was just too conservative), I should double check. Note that there's also a packed coefficient branch for (very) small primes elsewhere too. On the other hand, the combination of n_trailing_zeros(mod.n - 1) < n_max(depth, SD_FFT_CTX_W2TAB_INIT) and bn < 1500 already imply NMOD_BITS(mod) >= 12 (maybe ± 1).

{edit}: if it fits a single default FFT prime, then it should be always better to use it because you don't pay the cost of constructing the temporary context, the modular multiplication likely have the same cost regardless of the prime. I think the best solution here is to check exactly whether np = 1 later, but thinking about it, the check here is a bit too conservative

    /* need prod_of_primes >= blen * 4^modbits */
    for (np = 1; np < 4; np++)
    {
        if (flint_mpn_cmp_ui_2exp(crt_data_prod_primes(R->crts + np - 1),
              R->crts[np - 1].coeff_len, bn, 2*modbits) >= 0)
        {
            break;
        }
    }

it just compares bn * 2^(2*modbits) with the product of the primes, but bn * 2^(2*modbits) may be up to 4 times larger than bn * mod^2. Of course computing bn * mod^2 exactly needs about 3 limb multiplications and a few additions, but using double arithmetic and add {small constant, to be determined}*machine epsilon should be fine, since it's at most 2^(64*3), so fits in a double.

But the overhead might be significant for relatively short multiplications, say of length around 1000.

Indeed. Yet another alternative is to have the cache inside the nmod_t object, though this is an ABI-breaking change and increases every nmod_t object by one pointer (and it feels a bit too specialized).

fredrik-johansson · 2026-05-18T12:23:07Z

For sufficiently small moduli the exact convolution always fits a single default FFT prime, so we don't lose much by going through the standard path. But one could imagine wanting to use FFT directly for something like a length 2^18 convolution modulo an 18-bit FFT prime, which barely doesn't fit.

Sounds about right (probably I was just too conservative), I should double check. Note that there's also a packed coefficient branch for (very) small primes elsewhere too. On the other hand, the combination of n_trailing_zeros(mod.n - 1) < n_max(depth, SD_FFT_CTX_W2TAB_INIT) and bn < 1500 already imply NMOD_BITS(mod) >= 12 (maybe ± 1).

{edit}: if it fits a single default FFT prime, then it should be always better to use it because you don't pay the cost of constructing the temporary context, the modular multiplication likely have the same cost regardless of the prime. I think the best solution here is to check exactly whether np = 1 later, but thinking about it, the check here is a bit too conservative

It should be a bit better to use the given prime for the FFT instead of a default prime if you already have the context object, as the final output modular reduction should be a bit faster.

    /* need prod_of_primes >= blen * 4^modbits */
    for (np = 1; np < 4; np++)
    {
        if (flint_mpn_cmp_ui_2exp(crt_data_prod_primes(R->crts + np - 1),
              R->crts[np - 1].coeff_len, bn, 2*modbits) >= 0)
        {
            break;
        }
    }
it just compares bn * 2^(2*modbits) with the product of the primes, but bn * 2^(2*modbits) may be up to 4 times larger than bn * mod^2. Of course computing bn * mod^2 exactly is not cheap, but using double arithmetic and add {small constant, to be determined}*machine epsilon should be fine, since it's at most 2^(64*3), so fits in a double.

Yeah, this bounds check really doesn't need so sloppy. It should just compute the exact upper bound with a handful of limb operations and do an mpn_cmp (the speedup using double would be minimal). This is already done in the CRT code:

flint/src/fft_small/nmod_poly_mul.c

Line 392 in 2196bbe

{

flint/src/fft_small/nmod_poly_mul.c

Line 587 in 2196bbe

/* Bound coefficients before modular reduction. */

The packing code for tiny moduli in _nmod_poly_mulmid_fft_small_repack even tries a Cauchy-Schwarz bound to gain an extra bit or so of wiggle room, but for larger primes this is probably not worth the overhead.

But the overhead might be significant for relatively short multiplications, say of length around 1000.

Indeed. Yet another alternative is to have the cache inside the nmod_t object, though this is an ABI-breaking change and increases every nmod_t object by one pointer (and it feels a bit too specialized).

Having some kind of extended nmod object that would be passed by reference has been discussed elsewhere. When/if we add something like this, an optional pointer to an FFT context inside this object would indeed be a possible solution.

user202729 · 2026-05-19T00:43:57Z

Also, it's not clear whether pushing the number of FFT primes much further than 8 is worth it. Note that each CRT needs k^2 limb multiplications where k is the number of primes (if you divide and conquer, it's probably k log^2 k, but that assumes nested FFT/Karatsuba subroutine, which is unlikely to be effective in practice)

fredrik-johansson · 2026-05-19T13:05:15Z

Also, it's not clear whether pushing the number of FFT primes much further than 8 is worth it.

I'm certain that it's worth it; fmpz_poly_mul currently slows down a lot when it switches from fft_small to mul_KS. Note that Kronecker substitution is basically doubling the size of the convolution we need to compute. That means using more primes is probably going to win even quite a bit into the range where CRT dominates the FFTs.

user202729 added 2 commits May 16, 2026 14:54

Add some documentations and simplify s1worker_func slightly

3b91c60

Direct FFT for suitable modulus

a2355b6

user202729 force-pushed the direct-fft branch from 5b9d7bb to a2355b6 Compare May 16, 2026 07:54

user202729 mentioned this pull request May 19, 2026

FFT context initialization speedup #2690

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct FFT for suitable modulus#2681

Direct FFT for suitable modulus#2681
user202729 wants to merge 2 commits into
flintlib:mainfrom
user202729:direct-fft

user202729 commented May 16, 2026

Uh oh!

fredrik-johansson commented May 18, 2026

Uh oh!

user202729 commented May 18, 2026 •

edited

Loading

Uh oh!

fredrik-johansson commented May 18, 2026

Uh oh!

user202729 commented May 19, 2026 •

edited

Loading

Uh oh!

fredrik-johansson commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

user202729 commented May 16, 2026

Uh oh!

fredrik-johansson commented May 18, 2026

Uh oh!

user202729 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fredrik-johansson commented May 18, 2026

Uh oh!

user202729 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fredrik-johansson commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

user202729 commented May 18, 2026 •

edited

Loading

user202729 commented May 19, 2026 •

edited

Loading