include scaling and componentwise product routines for convolution
include end-to-end convolution tests
rewrite code in asm to schedule it properly
take advantage of sqrt(1/2) (1+i) halfway through pass loop
speed up fftc8_512, fftc8_1024
include larger transforms?
