Extend to larger sizes. ``huge'' for anything that fits into memory on a
32-bit machine. Maybe support disk too.

Extend to intermediate sizes. Include a generic interface that handles
arbitrary multiples of 48 bits.

Include uint32 and int30 compatibility interfaces.

Try reducing fuzz to 5 or 6 bits in zmult_48_32. This would allow
zmult_poly_16_plus() to merge two inputs in radix 2^17.

Schedule for the original Pentium and for the UltraSPARC. Both of these
chips are easy to schedule by hand.

Improve the Pentium Pro scheduling, for example in zmult_48_4_plus().
This is not easy to do by hand. A simulator would be helpful.

Take advantage of 0's in FFT inputs. After zmult_48_32_spread(), for
example, half of the array elements are known to be 0.

Speed up spread, unspread, twist.

Think about incorporating djbfft for 1536-bit products. FFTs over C are
competitive with FFTs over Z/(2^192+1), especially on the UltraSPARC,
but I need a tighter analysis of roundoff error.

Switch to asm for x86 with a reasonable calling convention.

Merge small poly routines of various sizes into a single top routine and
a single bottom routine. This really needs asm.

Clean up declarations in zmult.h.
