- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
According to ThreadedFunctionsList.txt, "ippsFFTFwd_CToC_32fc_I" is threaded. However, a simple timing loop shows no difference in execution time whether I leave the number of IPP threads at 4 (the default value per "ippGetNumThreads") or reduce it to 1 (via "ippSetNumThreads"). I have tried FFT lengths from 2^3 to 2^20. Parallel Amplifier shows CPU Usage is 1 regardless of number of threads. What's up/what should I check?
I am running Intel IPP 6.1 dynamic libraries obtained through Parallel Studio (Composer Update 4) with Visual Studio 2008 under Windows Vista SP2 on an Intel Core2 Quad CPU Q6700 processor. I have successfully written and ran other multithreaded programs using OpenMP and the Intel compiler, utilizing all four cores. Here is a code fragment from the timing program, which runs the FFT repeatedly on the same data:
[cpp]is = ippsFFTInitAlloc_C_32fc(&pSpec, powerOf2, flag, hint); is = ippsFFTGetBufSize_C_32fc(pSpec, &bufSize); if (bufSize) pBuffer = (Ipp8u*) ippMalloc(bufSize); else pBuffer = NULL; is = ippSetNumThreads(1); // or 4, the default per ippGetNumThreads() startMsec = timeGetTime(); for (long iter = 0; iter < numIter; iter++) { is = ippsFFTFwd_CToC_32fc_I( (Ipp32fc*)x, (IppsFFTSpec_C_32fc*) pSpec, (Ipp8u*) pBuffer ); } finishMsec = timeGetTime(); deltaTime = 0.001 * (finishMsec - startMsec); perTime_mt = (deltaTime/numIter)*1e6; cout << perTime_mt << endl; [/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
http://software.intel.com/sites/products/documentation/hpc/ipp/ia32/index.htm
Go to the "Supporting Multithreaded Applications" chapter in the manual.
Also, please review this article from the knowledge base, which I have updated with additional information from engineering regarding your FFT observations:
OpenMP and the Intel IPP Library
Paul
- - - - Supporting Multithreaded Applications - - - -
Intel IPP Threading and OpenMP* Support
All Intel IPP functions are thread-safe in both dynamic and static libraries and can be used in the multithreaded applications.
Some Intel IPP functions contain OpenMP* code that gives significant performance gain on multi-processor and multi-core systems. These functions include color conversion, filtering, convolution, cryptography, cross correlation, matrix computation, square distance, and bit reduction, etc.
Refer to the ThreadedFunctionsList.txt document to see the list of all threaded functions in the doc directory of the Intel IPP installation.
Setting Number of Threads
The default number of threads for Intel IPP threaded libraries is equal to the number of processors in the system and does not depend on the value of the OMP_NUM_THREADS environment variable.
To set another number of threads used by Intel IPP internally, call the function ippSetNumThreads(n)at the very beginning of an application. Here n is desired number of threads (1,...). If internal parallelization is not desired, call ippSetNumThreads(1).
Using Shared L2 Cache
Some functions in signal processing domain are threaded on 2 threads intended for the Intel Core2 processor family, and exploit advantage of merged L2 cache. These functions (single and double precision FFT, Div, Sqrt, etc.) achieve the maximum performance if both two threads are executed on the same die. In this case these threads work on the same shared L2 cache. For processors with two cores on the die this condition is satisfied automatically. For processors with more than two cores, a special OpenMP environmental variable must be set:
KMP_AFFINITY=compact
Otherwise the performance may degrade significantly.
Nested Parallelization
If the multithreaded application created with the OpenMP uses the threaded Intel IPP function, this function will operate in the single thread because the nested parallelization is disabled by default in the OpenMP.
If the multithreaded application created with other tools uses the threaded Intel IPP function, it is recommended to disable multithreading in Intel IPP to avoid nested parallelization and possible performance degradation.
Disabling Multithreading
To disable multi-threading link your application with IPP non-threaded static libraries, or build the custom SO using the non-threaded static libraries.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's possible that the overhead associated with the loop and the OpenMP setup/teardown is overwhelming the amount of time spent doing work in the function, in which case you may not see a significant difference in time to execute (possibly even longer time to execute with 4 threads). Likewise, it might explain why you don't see the CPU usage be something other than 1, since I believe that number is a sampled result.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your reply. I discovered my investigation was incomplete. With further study using the Parallel Amplifier, I discovered that the IPP FFT I am calling does use multiple threads, but only for lengths between 2^13 and 2^17, inclusive. Further, it only uses two threads for those lengths, and the improvement over single-threaded performance is marginal or inconsistent. Here is a summary of my per-FFT timing results for even powers-of-2 from 12 to 20, running each length 10 times:
[plain]Length *** MaxThreads=4 *** *** MaxThreads=1 *** Threads Created min usec max usec min usec max usec For MaxThreads=4 ------ -------- -------- -------- -------- ---------------- 2^12 21.80 23.40 21.80 23.40 1 2^14 93.50 156.00 124.50 125.00 2 2^16 374.00 468.00 499.00 515.00 2 2^18 2800.00 3120.00 2800.00 3120.00 1 2^20 19960.00 20280.00 19960.00 20600.00 1
[/plain]
So, the IPP FFT is indeed multithreaded, but yields limited impact.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Interesting study. Thank you. I will request further clarification from engineering.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
http://software.intel.com/sites/products/documentation/hpc/ipp/ia32/index.htm
Go to the "Supporting Multithreaded Applications" chapter in the manual.
Also, please review this article from the knowledge base, which I have updated with additional information from engineering regarding your FFT observations:
OpenMP and the Intel IPP Library
Paul
- - - - Supporting Multithreaded Applications - - - -
Intel IPP Threading and OpenMP* Support
All Intel IPP functions are thread-safe in both dynamic and static libraries and can be used in the multithreaded applications.
Some Intel IPP functions contain OpenMP* code that gives significant performance gain on multi-processor and multi-core systems. These functions include color conversion, filtering, convolution, cryptography, cross correlation, matrix computation, square distance, and bit reduction, etc.
Refer to the ThreadedFunctionsList.txt document to see the list of all threaded functions in the doc directory of the Intel IPP installation.
Setting Number of Threads
The default number of threads for Intel IPP threaded libraries is equal to the number of processors in the system and does not depend on the value of the OMP_NUM_THREADS environment variable.
To set another number of threads used by Intel IPP internally, call the function ippSetNumThreads(n)at the very beginning of an application. Here n is desired number of threads (1,...). If internal parallelization is not desired, call ippSetNumThreads(1).
Using Shared L2 Cache
Some functions in signal processing domain are threaded on 2 threads intended for the Intel Core2 processor family, and exploit advantage of merged L2 cache. These functions (single and double precision FFT, Div, Sqrt, etc.) achieve the maximum performance if both two threads are executed on the same die. In this case these threads work on the same shared L2 cache. For processors with two cores on the die this condition is satisfied automatically. For processors with more than two cores, a special OpenMP environmental variable must be set:
KMP_AFFINITY=compact
Otherwise the performance may degrade significantly.
Nested Parallelization
If the multithreaded application created with the OpenMP uses the threaded Intel IPP function, this function will operate in the single thread because the nested parallelization is disabled by default in the OpenMP.
If the multithreaded application created with other tools uses the threaded Intel IPP function, it is recommended to disable multithreading in Intel IPP to avoid nested parallelization and possible performance degradation.
Disabling Multithreading
To disable multi-threading link your application with IPP non-threaded static libraries, or build the custom SO using the non-threaded static libraries.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have the same problem with FFT (IPP ver 7.0). The FFT len 2^19.
I run it on 12 cores machine and see that only one core is working. And I did all that you wrote.
For instance, Direct FIR function is very good parallelized.
Can you help me with FFT ?
Arkady
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page