At Sage Days 6, for fun I wrote a more cache-friendly version of NTL's underlying FFT routine. NTL uses this routine for polynomial arithmetic (multiplication, division, gcd, etc) with polynomials of high degree with smallish coefficients.
The new code might be faster than the code shipped with NTL on some architectures. On some machines it might be slower, especially if the thresholds are set incorrectly. You have been warned.
I compared the speed of the new routine and the original routine for various transform lengths on an AMD Opteron (sage.math, thanks to William Stein for supplying this machine), a Core 2 Duo (my laptop), and a PowerPC G5 (a single-core desktop). The Opteron is a 64-bit machine with a 64KB L1 cache and a 1MB L2 cache. The Core 2 Duo is 64-bit with a 4MB L2 cache, but I only built NTL in 32-bit mode. The G5 is 64-bit with a 512KB L2 cache, but I only built NTL in 32-bit mode here too (linker troubles which I don't have time to fix...).
The graphs below show a normalised measure of the time per FFT (the time is divided by n log(n), where n is the transform length) on all three machines. On the opteron, the new code starts winning at n = 217, which is at the L2 cache boundary. Similarly on the G5, which is somewhat beyond the L2 boundary. On the core 2 duo, the new code starts winning at n = 221, which is also around the L2 boundary. The improvement is much more impressive on the Opteron and the G5 than the intel chip; I don't know why this happens. Unfortunately I didn't see any improvement at the L1 boundary on any of the machines, which is what I had been hoping for. Note that the "new code" is actually using the original FFT below the threshold, which is why the graphs hug each other at those points.
This code is released under the GPL V2 or higher.
To use it, do the following:
There is a flag NTL_CACHE_FRIENDLY_FFT in the new FFT.h. If you clear it, the code will compile exactly as Shoup's original code. If it is set, the new code gets substituted. The new code calls the old code for any transform of length less than 2k where k is NTL_CACHE_FRIENDLY_FFT_THRESHOLD, which by default is 17.
A test module test.cpp is included in the tarball, which tests the code (checks that it gives the same results as NTL's original code on random data), and profiles it against the original code.
There is a CacheFriendlyBitReverseCopy function that does the same thing as BitReverseCopy, but in a cache-oblivious manner.
After that, it's basically Bailey's four-step algorithm. First do a bunch of small transforms, then matrix transpose plus some twiddle factors, then another bunch of small transforms, then transpose back. The transpose operations are implemented in a cache-oblivious manner (well, except for the basecase, whose size can be adjusted via TRANSPOSE_THRESHOLD).
Note that the FFT as a whole is definitely not cache-oblivious, since it's only a two-dimensional decomposition. This is reasonable given the maximum size of transform that NTL can handle anyway.