djbfft is the fastest available code for small power-of-2 complex DFTs
on a Pentium. It's also reasonably fast on other machines.

djbfft does a recursive in-place split-radix decimation-in-frequency
FFT, with precomputed roots of unity, using my ``3 to -1'' improvement
to chop the number of root loads in half. One split-radix pass fits
nicely into the Pentium's 8 floating-point registers. For machines with
more registers it would be better to do two passes at once.

djbfft does not yet attempt to limit cache misses. For large transforms
the number of simultaneous passes should be matched to the details of
the memory hierarchy, as per Gentleman-Sande.

djbfft will work on any UNIX system. Currently all the code is written
in C, tuned to produce reasonable results under gcc on a Pentium.

This version of djbfft supports both single and double precision. It has
no special support for real data.
