Community
cancel
Showing results for 
Search instead for 
Did you mean: 
gauss2561
Beginner
50 Views

Not seeing linear speedup

We have an application that uses the IPP FFT functions. There are several calls to a function using the FFT that we want to thread. We use OpenMP at the application level and link to the single-threaded version of the IPP libraries. We are using Visual Studio C++ 2008. A skeleton of the code looks like this:

#pragma omp parallel for for (int i = 0; i < numLoops; i++) { doSomething(); // this function calls FFT }

On a four-core HT i7 processor we see a speedup over a non-threaded version by a factor of about 2.5. When we do the same test using FFTW instead of IPP, we see a speedup of 4. I.e., a linear speedup proportional tothe number of real (non-hyperthreaded) cores.

My question is, if we see a linear speedup with FFTW, wouldn't we expect to see the same with IPP?

Bruce

0 Kudos
4 Replies
Naveen_G_Intel
Employee
50 Views

Hi Bruce,

Hi,

What is the value of OMP_NUM_THREADS, while running IPP FFT? What is the size of FFT?

Another reply on IPP forum with additional information on improving IPP FFT performance refer to here.

Thanks,

Naveen Gv

gauss2561
Beginner
50 Views

Hi Naveen,

We don't set the environment variable OMP_NUM_THREADS. My understanding is that the number of threads will thus be automatically set to the number of available cores seen by the operating system, which would be 8 in our case.

The size of the FFTs is 128K.
Thanks for the link about improving IPP FFT performance. We are following all the recommendations there.
Bruce
Chao_Y_Intel
Employee
50 Views


Bruce,

I recalled some similar performance issue before: Before the FFT computation, it will use some memory allocation function (ippsFFTInitAlloc, or some others) to allocate memory for FFT computation. Memory allocation function is serial, your OpenMP code cannot be paralleled at this part. To resolve the problem, allocate the memory first before the for loop.

Thanks,
Chao

gauss2561
Beginner
50 Views

Thanks very much, Chao. We are indeed allocating memory inside the loop. I'll move it out and see if it helps.
Bruce
Reply