djbfft is the fastest available code for power-of-2 complex DFTs on a
Pentium or Pentium MMX. It's also reasonably fast on other machines.

djbfft does a recursive in-place split-radix-2/4 decimation-in-frequency
FFT, with precomputed roots of 1, using my ``3 to -1'' improvement to
chop the number of root loads in half. One split-radix pass fits nicely
into the Pentium's 8 floating-point registers.

The djbfft code is structured to support future optimizations. The
internal fft_twopass() interface is designed to allow two simultaneous
split-radix passes on machines with enough floating-point registers.
This will also improve cache behavior for large transforms.

djbfft includes real4, real8, complex4, and complex8 convolution code.
Real convolution is nearly twice as fast as complex convolution.

djbfft includes four test tools: accuracy, accuconv, speed, and
speedconv. On a Pentium the speed and speedconv programs automatically
use RDTSC to report exact cycle counts.

djbfft will work on any UNIX system. Currently all the code is written
in C, tuned to produce reasonable results under gcc on a Pentium.
