Direct FFT for suitable modulus#2681
Conversation
|
This looks good. Thanks for implementing it! I'll experiment a bit with it before merging. Minor question: why limit to moduli greater than 20 bits? For sufficiently small moduli the exact convolution always fits a single default FFT prime, so we don't lose much by going through the standard path. But one could imagine wanting to use FFT directly for something like a length 2^18 convolution modulo an 18-bit FFT prime, which barely doesn't fit.
It is not common because we don't take advantage of it currently. A function like The drawback of the solution in this PR seems to be that one has to create an FFT context object, do a primality test etc. for each multiplication. This shouldn't matter that much for big multiplications except that computing the root of unity tables on the fly will be a bit slower than with one of the 8 primes in the cached "mpn" context. But the overhead might be significant for relatively short multiplications, say of length around 1000, where direct FFT acceleration would be really interesting for many algorithms. Note that We should consider some combination of the following:
(None of these need to be done in this PR.) |
Sounds about right (probably I was just too conservative), I should double check. Note that there's also a packed coefficient branch for (very) small primes elsewhere too. On the other hand, the combination of {edit}: if it fits a single default FFT prime, then it should be always better to use it because you don't pay the cost of constructing the temporary context, the modular multiplication likely have the same cost regardless of the prime. I think the best solution here is to check exactly whether it just compares
Indeed. Yet another alternative is to have the cache inside the |
It should be a bit better to use the given prime for the FFT instead of a default prime if you already have the context object, as the final output modular reduction should be a bit faster.
Yeah, this bounds check really doesn't need so sloppy. It should just compute the exact upper bound with a handful of limb operations and do an flint/src/fft_small/nmod_poly_mul.c Line 392 in 2196bbe flint/src/fft_small/nmod_poly_mul.c Line 587 in 2196bbe The packing code for tiny moduli in
Having some kind of extended |
|
Also, it's not clear whether pushing the number of FFT primes much further than 8 is worth it. Note that each CRT needs |
I'm certain that it's worth it; |
The
fft_smallcode should work with any prime moduluspsuch thatp-1has sufficiently high2-valuation, but currently it is only used when it is one of the precomputed primes. See the comments in code for more detail.For benchmarking, one way to compare the old and the new code is by adding
return 0in the newly-added_nmod_poly_should_directly_fft.This only speeds things up in the specialized case of working modulo a prime good for FFT, though. Which is not exactly common.