djbfft is the fastest available code for small power-of-2 complex DFTs
on a Pentium. It's also reasonably fast on other machines.

djbfft does a recursive in-place split-radix decimation-in-frequency
FFT, with precomputed roots of unity, using my ``3 to -1'' improvement
to chop the number of root loads in half. One split-radix pass fits
nicely into the Pentium's 8 floating-point registers. For machines with
more registers it would be better to do two passes at once.

djbfft does not yet attempt to limit cache misses. For large transforms
the number of simultaneous passes should be matched to the details of
the memory hierarchy, as per Gentleman-Sande.

djbfft will work on any UNIX system. Currently all the code is written
in C, tuned to produce reasonable results under gcc on a Pentium.

This version of djbfft is simply a proof-of-concept implementation. It
is restricted to double precision, has no special support for real data,
and doesn't even include an inverse DFT.
