I'm trying to speed up FFT processing on a dual 2 GHz Pentium 4 Xeon. When I use FFTW, which is fast C code available on the web it takes ~1 second to take 20000 1k FFTs. It takes ~.75 seconds using MKL 5.2 or 6.0, and .5 seconds using IPP. Why isn't the Intel microcode faster? Is is simply that their code has not been optimized for Xeon processors yet? It is disappointing to at best only get a factor of 2 using parallelized microcode. Also IPP doesn't seem to support either Fortran calls or parallelization, both of which I need. I've asked Intel's Premier support about this a couple of weeks ago but haven't received an explanation. In general how many FLOPS should you get per clock cycle on a Xeon? Isn't it 4, or 8 GFLOPS per processor? I'm certainly not seeing this using their microcode.
Even if you didn't allow for memory access, the fastest possible rate for issue of SSE or SSE2 parallel multiplication instructions would be 1 per 2 clocks. Assuming that all the additions, memory accesses, and other operations could be fit in without slowing this down, you still would be taking at least 6 clocks for 10 flops, which looks to me like 3.3Gflops peak speed on 2Ghz. That has to be extremely optimistic. Even with algorithms which would allow for approaching 2 flops per clock, I haven't seen more than 60% of that in sustained performance.
The main point of MKL and IPP over compiler generated code is to organize more effective use of cache. Evidently, memory operations, which you and I have tried to ignore, are of equal importance.