I have the problem with FFT (IPP ver 7.0), ippsFFTFwd_CToC_32fc. The FFT len 2^19. According to ThreadedFunctionsList.txt, "ippsFFTFwd_CToC_32fc" is threaded.
I run it on 12 cores machine (L5640 2x6),through Parallel Studio, Visual Studio 2010 under Windows Server 2008, 64bit.
And see that only one core is working. And I did all that wroted in doc.
For instance, Direct FIR function is very good parallelized.
Can you help me with FFT ?
This looks a problem we discussed in the forum before. Please find some comments from the function expert on the performance:
1)FFT function uses memory buffer ~equal to vector length for rather small FFT orders ( < ~19 depends on platform (cache size)) therefore for such orders there is no difference between in-place and out-of-place cases performance FFT is calculated in the buffer and then result is copied to the destination so for in-cache cases it doesnt matter where to copy to src or to dst vector. For rather large orders (>19) in-place version is faster as internally FFT uses buffer of smaller size (less than input vector length). I think that HDD case should not be discussed here
2) FFT is threaded for fit into shared L2 cases only and for Core2 CPUs only (and on 2 threads only). For small orders OMP overhead is greater than benefit, for large orders (out-of-cache) memory effects play negative role so customers investigation is right there is no any threading for order 19 and above.