At Sage Days 6, for fun I wrote a more cache-friendly version of NTL's underlying FFT routine. NTL uses this routine for polynomial arithmetic (multiplication, division, gcd, etc) with polynomials of high degree with smallish coefficients.

The new code *might* be faster than the code shipped with NTL
on *some architectures*. On some machines it might be
*slower*, especially if the thresholds are set incorrectly.
You have been warned.

I compared the speed of the new routine and the original routine for various transform lengths on an AMD Opteron (sage.math, thanks to William Stein for supplying this machine), a Core 2 Duo (my laptop), and a PowerPC G5 (a single-core desktop). The Opteron is a 64-bit machine with a 64KB L1 cache and a 1MB L2 cache. The Core 2 Duo is 64-bit with a 4MB L2 cache, but I only built NTL in 32-bit mode. The G5 is 64-bit with a 512KB L2 cache, but I only built NTL in 32-bit mode here too (linker troubles which I don't have time to fix...).

The graphs below show a normalised measure of the time per FFT (the time is
divided by n log(n), where n is the transform length) on all three machines.
On the opteron, the new code starts winning at n = 2^{17}, which is
at the L2 cache boundary. Similarly on the G5, which is somewhat
beyond the L2 boundary. On the core 2 duo, the new code starts winning at
n = 2^{21}, which is also around the L2 boundary. The improvement
is much more impressive on the Opteron and the G5 than the intel chip;
I don't know why this happens. Unfortunately I didn't see any
improvement at the L1 boundary on any of the machines, which is what I had been hoping for.
Note that the "new code" is actually using the original FFT below the threshold,
which is why the graphs hug each other at those points.

This code is released under the GPL V2 or higher.

To use it, do the following:

- Download and unpack the NTL 5.4.1 package from the NTL website.
- Replace the files
`src/FFT.c`and`include/NTL/FFT.h`with the corresponding files in this tarball. - Build and use NTL as usual (see instructions on Shoup's site).

There is a flag `NTL_CACHE_FRIENDLY_FFT` in the new FFT.h. If you
clear it, the code will compile exactly as Shoup's original code.
If it is set, the new code gets substituted. The new code calls the old code
for any transform of length less than 2^{k} where k is
`NTL_CACHE_FRIENDLY_FFT_THRESHOLD`, which by default is 17.

A test module `test.cpp` is included in the tarball, which tests the
code (checks that it gives the same results as NTL's original code on random
data), and profiles it against the original code.

There is a `CacheFriendlyBitReverseCopy` function that does the same
thing as `BitReverseCopy`, but in a cache-oblivious manner.

After that, it's basically Bailey's four-step algorithm. First do a bunch
of small transforms, then matrix transpose plus some twiddle factors, then
another bunch of small transforms, then transpose back. The transpose
operations are implemented in a cache-oblivious manner (well, except for the
basecase, whose size can be adjusted via `TRANSPOSE_THRESHOLD`).

Note that the FFT as a whole is definitely not cache-oblivious, since it's only a two-dimensional decomposition. This is reasonable given the maximum size of transform that NTL can handle anyway.

- Make it even faster! This was hacked together pretty quickly, so it's not all that super-optimised.
- The CacheFriendlyBitReverseCopy routine uses way too much memory. It should be possible to do this with one small lookup table and some bit-fiddling.
- Write some automatic tuning code which tries various transform
sizes and figures out where (or if!) the cache-friendly version starts winning
over the existing code, and sets
`NTL_CACHE_FRIENDLY_FFT_THRESHOLD`accordingly. This tuning code should go into NTL's build scripts. - Throw the code into Sage.

Enjoy!

Back to the main page