I'm trying to optimize a program I've been working on for a while. Among other things, it performs about 2400 4096 (N=12) real-to-complex or complex-to-real FFT's per second. So tonight I downloaded IPP to see how much the speed would increase if I replace my own (SSE2 optimized) FFT code by IPP's FFT.
I found some benchmarks online that stated that IPP's FFT would run in about 20 s on a CPU that's close in speed (at least per core) to mine (the test was run on a Xeon 5100 CoreDuo @ 3.00 GHz, 32 bit code). I'm working on a Q9450 quad core @ 2.67 GHz, which has - per core - roughly the same speed (of course memory speed could be different, it wasn't reported in the benchmark).
So I just wrote a small program that calls IPP's FFT with bogus data (only 0's). Per call I'm measuring a time of about 60 s (note: 32 bit code) - three times as high as what I had expected. What's more, my own FFT implementation uses 70 s - and there are still some opportunities for optimizations. So I'm probably doing something wrong.
Here's the code that I'm using. Am I doing something wrong? (I'm not interested in correct code, only in performance issues):
[cpp]Ipp32f* src = ippsMalloc_32f( 8192 );
Ipp8u *buf = NULL;
for (int i=0; i<8192; i++)
src = 0.0f;
IppStatus s = ippsFFTInitAlloc_R_32f(&pFFTSpec, 12, IPP_FFT_DIV_INV_BY_N, ippAlgHintFast);
Check (s != ippStsNoErr); // Ok
int buf_len = 0;
s = ippsFFTGetBufSize_R_32f(pFFTSpec, &buf_len);
Check (s != ippStsNoErr) // Ok
// buf_len returns 0 (is that correct for floats real-to-complex 4096?)
buf = ippsMalloc_8u(buf_len); // so 0 bytes
// START TIME MEASUREMENT
for (i = 0; i < 50000; i++)
ippsFFTFwd_RToCCS_32f_I(src, pFFTSpec, buf);
// END TIME MEASUREMENT - takes a little over 3 seconds
If I call my own (also real-to-float) code instead, it takes 7 seconds - but I'm cheating by doing 2 FFT's simultaneously (each SSE2 register contains the real and imaginary values of 2 separate blocks of data - I'm doing audio processing and process the left and right channel simultaneously).
Note: Since the input is all 0's, the data should remain 0 at each step, so there should be no denormals or other problems that affect the speed.
if you link your test program with static IPP libraries, please make sure you call ippStaticInit functions at the beginning of your application. This call will init IPP dispatcher to use appropriate processor specific code. If you don't do that then generic C code will be chosen IPP by default.