Dear all,
we use IPP (5.3.4) within a data acquistion apllication. Besides some other threads it consist of two 'main' threads:
- a data acquisiton thread in which images are convertedto object features
- a GUI thread in which post processing takes place (e.g. file writing, image display)
Both threads make use of IPP, but also both threads donot use the CPU for 100%. It seems that IPP use local parallellism in most functions. This indeed makes the call faster (twice) on dual / multicore machines. However it also gives an extra 20-30% processor load (on a DELL T3400 dualcore machine), compared to disabling the threading in IPP (thru 'ippSetNumThreads(1);'). Spying with Windows performance monitor one can notice that the thread context switches increase from an average of 1000 per second to 200000 per second.
This effect seems tolimit the performance of IPP (severly). Is there a recommended strategy to minimize this effect? Can this be circumvented completely? In a test program we made 3 tests (see below):
- single threaded
- IPP used from 2 threads
- IPP used from 1 thread, with a second thread flooding the CPU completely but not using IPP
In the last 2 options one can notice that IPP is actually slower without the 'thru 'ippSetNumThreads(1);' call.
thx in advance.
P.s. 1:we checked that IPP 5.3.4 is correctly loaded, e.g. it is using 'ippip8-5.3.dll' on my dual core machine.
P.s. 2: code:
[cpp]#include #include #include #pragma comment(lib, "ipps.lib") #pragma comment(lib, "ippcore.lib") #pragma comment(lib, "ippi53.lib") int main() { //performance will be half on dual core //ippSetNumThreads(1); //Ippi: 7.733000 single threaded //Ippi: 14.389000 single threaded, with ippSetNumThreads(1) //Ippi: 8.640000 + 8.812000 multi threaded //Ippi: 7.296000 + 7.312000 multi threaded, with ippSetNumThreads(1) //Ippi: 16.450000 flood threaded //Ippi: 14.482000 flood threaded, with ippSetNumThreads(1) enum IppiThread { eItSingle, eItMulti, eItMultiFlood, }; //const size_t nMax = 1000000; const size_t nMax = 5000000; //const IppiThread eIt = eItSingle; const IppiThread eIt = eItMulti; //const IppiThread eIt = eItMultiFlood; switch (eIt) { case eItSingle: { TestIntlIppiImpl(nMax); } break; case eItMulti: { boost::thread_group threads; for (int i = 0; i != 2; ++i) { threads.create_thread(boost::bind(&TestIntlIppiImpl, nMax / 2)); } threads.join_all(); } break; case eItMultiFlood: { long lContinue = 1; boost::thread thread1(&TestIntlIppiImplFlood, &lContinue); boost::thread thread2(&TestIntlIppiImpl, nMax); thread2.join(); BOOST_INTERLOCKED_EXCHANGE(&lContinue, 0); thread1.join(); } break; default: _ASSERT(false); break; } return 0; } //---------------------------------------------------------------------------- // Function TestIntlIppiImpl //---------------------------------------------------------------------------- // Description : test ippi impl. //---------------------------------------------------------------------------- void TestIntlIppiImpl(size_t nMax) { const int nWidth = 320; const int nHeight = 200; int nStepSizeSource = 0; int nStepSizeTarget = 0; int nStepSizeSubtract = 0; IppiSize roiSize = {nWidth, nHeight}; nmbyte* pImageBufferSource = ippiMalloc_8u_C1(nWidth, nHeight, &nStepSizeSource); nmbyte* pImageBufferTarget = ippiMalloc_8u_C1(nWidth, nHeight, &nStepSizeTarget); nmbyte* pImageBufferSubtract = ippiMalloc_8u_C1(nWidth, nHeight, &nStepSizeSubtract); ippiImageJaehne_8u_C1R(pImageBufferSource, nStepSizeSource, roiSize); ippiImageJaehne_8u_C1R(pImageBufferTarget, nStepSizeTarget, roiSize); ippiImageJaehne_8u_C1R(pImageBufferSubtract, nStepSizeSubtract, roiSize); for (size_t n = 0; n != nMax; ++n) { ippiSub_8u_C1RSfs(pImageBufferSubtract, nStepSizeSubtract, pImageBufferSource, nStepSizeSource, pImageBufferTarget, nStepSizeTarget, roiSize, 1); } ippiFree(pImageBufferSubtract); ippiFree(pImageBufferTarget); ippiFree(pImageBufferSource); } //---------------------------------------------------------------------------- // Function TestIntlIppiImplFlood //---------------------------------------------------------------------------- // Description : flood cpu //---------------------------------------------------------------------------- void TestIntlIppiImplFlood(long* pContinue) { //do not use synchronisation like condition variables, //because they relinquish the processor for (;;) { if (!(*pContinue)) { break; } } } [/cpp]
Link Copied
Quoting - Vladimir Dudnik (Intel)
There is no magic. If your system have two cores then it is able to do only two threads simulteneously. If number of active threads (which do load CPU)in your applications more than number of physically available cores then some threads will wait for their time slice. And this will lower overall application performance. In this case we do recommend to use single threaded IPP libraries or disable threading in multithreaded IPP libraries.
On systems with bigger number of core there is an opportunity to balance, for example of 4 or 8 cores system, you may enable for example 2 threads for IPP and use rest threads for your application needs.
Just need to avoid thread oversubscription situations.
Regards,
Vladimir
Thx.
The problem is ofc in the word 'oversubsciption': sometimes the other thread is busy and then the cores get oversubscribed, but sometimes the thread is waiting. Disabling then the use of more threads in IPP would give a performance penalty.
..........
For more complete information about compiler optimizations, see our Optimization Notice.