Parallel Intel IPP FFT function

caosun · ‎03-19-2013

Hi experts:

I want to do multiple FFT and I want to do them in parallel. So my code is similar as the following:

ippsFFTGetSize_C_32fc(....)

ippsFFTInit_C_32fc(...FFTSpec, Buffer)

parallel_for(0, chunks, [=](size_t i){

ippsFFTFwd_CToC_32fc(...FFTSpec, Buffer);

}

But I found that the results are not correct. I suspect that the FFTspec and Buffer record the status when do fft operation, so there is conflict when I do parallel FFTs.

Could you please let me know the real reason?

And is there any way I can parallel multiple FFTs? (I do not want to put ippsFFTInit_C_32fc in the loop as it is time-consuming)

SergeyKostrov · ‎03-19-2013

An array of FFTSpec and Buffer needs to be used and a size of the array should be equal to the number of chunks: ... parallel_for( 0, chunks, [=](size_t i ) { ippsFFTFwd_CToC_32fc( ..., &FFTSpec, &Buffer ); } ... Also, please take a look at an article: software.intel.com/en-us/articles/threading-and-intel-integrated-performance-primitives for more information. A complete list of threaded IPP functions should be in the IPP docs folder.

Gennady_F_Intel · ‎03-19-2013

in that case for elimination threads oversubscription, you can call ippSetNumThreads(1).

SergeyKostrov · ‎03-19-2013

Please also take into account that ippsFFTFwd_CToC_32fc is threaded ( I just did a verification in v7.1 ) and All threading will be removed in the future versions of IPP. At the moment if your data set is large and internal IPP threading is working then your own TBB based threading could create more problems and could degrade performance. I would call it as a "double-threaded" processing and I think you really need to do performance evaluation. Do you have any numbers as an example? Gennady's suggestion should force single threaded processing by the IPP function and in that case your code looks good.

caosun · ‎04-07-2013

Hi,

I have already disabled the internal IPP OpenMP threading.

It does not make sense to me, if I use FFTSpec and Buffer where size of the array N should be equal to the number of chunks. The reasons are:

1. The program might not know how large the size is until it do parallel ippsFFTFwd_CToC_32fc, that is, the size is dynamic and could not be known beforehand.

2. The number of threads created is limited by the number of cores, and for each thread it do mulitple ippsFFTFwd_CToC_32fc in serial. So it indicates that we only need to create the maximal size of array equal to number of cores. And each thread has its own FFTSpec and Buffer. But how could I control that?

You comments are highly appreciated.

SergeyKostrov · ‎04-09-2013

>>...It does not make sense to me, if I use FFTSpec and Buffer where size of the array N should be equal to >>the number of chunks... This is by design of the function and if it is used in a multi-threaded environment different threads can Not share these parameters. A similar problem with application of IPP functions with TBB was solved by another IDZ user in 2012.

Igor_A_Intel · ‎04-09-2013

If all FFTs have the same order - it is enough to have 1 FFTSpec - in IPP terminology (described in the manual) Spec is always const, while State (for example FIRs, IIRs) stores temporal function state in order to provide stream processing. So for correct threading you should create one common FFTSpec and a number of unique buffers - one for each thread. Buffers are used for temporal store after each butterfly, while Spec contains only pre-calculated twiddle factors and bit-reverse table.

regards, Igor

caosun · ‎04-09-2013

Thank you for your information.

Igor, for one FFT order, we just need to create 1 FFTSpec, that's good. And I still need to create a number of buffers - one for each thread. But the problem is that I do not want to create number of chunks of buffers which I do not know beforehand. If I just need to create number of chunks = number of CPU cores = number of threads, that's will be great. Do you know how do that with Intel TBB tools?

SergeyKostrov · ‎04-10-2013

>>...If I just need to create number of chunks = number of CPU cores = number of threads, that's will be great... This is actually what guides for multi-threading will recommend and ( ideally ) you shouldn't exceed number of logical cores. Oversubscription also could be used but some performance impact is expected. Also, take a look at Intel C++ Compiler User and Reference Guides: ... Cache Blocking Cache blocking involves structuring data blocks so that they conveniently fit into a portion of the L1 or L2 cache. By controlling data cache locality, an application can minimize performance delays due to memory bus access. The application controls the behavior by dividing a large array into smaller blocks of memory so a thread can make repeated accesses to the data while the data is still in cache. For example, image processing and video applications are well suited to cache blocking techniques because an image can be processed on smaller portions of the total image or video frame. Compilers often use the same technique, by grouping related blocks of instructions close together so they execute from the L2 cache. The effectiveness of the cache blocking technique depends on data block size, processor cache size, and the number of times the data is reused. Cache sizes vary based on processor. An application can detect the data cache size using the CPUID instruction and dynamically adjust cache blocking tile sizes to maximize performance. As a general rule, cache block sizes should target approximately one-half to three-quarters the size of the physical cache. For systems that are Hyper-Threading Technology (HT Technology) enabled target one-quarter to one-half the physical cache size. (See Designing for Hyper-Threading Technology for more other design considerations.) ...

SergeyKostrov · ‎04-10-2013

>>...how do that with Intel TBB tools? Do you mean TBB classes? If Yes, take a look at TBB examples for details and simple_partitioner class could be used in your case.