Re: IPP with multithreaded applications - Page 2

gast128 · ‎10-26-2009

Dear all,

we use IPP (5.3.4) within a data acquistion apllication. Besides some other threads it consist of two 'main' threads:
- a data acquisiton thread in which images are convertedto object features
- a GUI thread in which post processing takes place (e.g. file writing, image display)

Both threads make use of IPP, but also both threads donot use the CPU for 100%. It seems that IPP use local parallellism in most functions. This indeed makes the call faster (twice) on dual / multicore machines. However it also gives an extra 20-30% processor load (on a DELL T3400 dualcore machine), compared to disabling the threading in IPP (thru 'ippSetNumThreads(1);'). Spying with Windows performance monitor one can notice that the thread context switches increase from an average of 1000 per second to 200000 per second.

This effect seems tolimit the performance of IPP (severly). Is there a recommended strategy to minimize this effect? Can this be circumvented completely? In a test program we made 3 tests (see below):
- single threaded
- IPP used from 2 threads
- IPP used from 1 thread, with a second thread flooding the CPU completely but not using IPP

In the last 2 options one can notice that IPP is actually slower without the 'thru 'ippSetNumThreads(1);' call.

thx in advance.

P.s. 1:we checked that IPP 5.3.4 is correctly loaded, e.g. it is using 'ippip8-5.3.dll' on my dual core machine.
P.s. 2: code:

[cpp]#include 
#include 
#include 

#pragma comment(lib, "ipps.lib")
#pragma comment(lib, "ippcore.lib")
#pragma comment(lib, "ippi53.lib")


int main()
{
   //performance will be half on dual core
   //ippSetNumThreads(1);

   //Ippi:   7.733000              single threaded     
   //Ippi:  14.389000              single threaded, with ippSetNumThreads(1)
   //Ippi:   8.640000 + 8.812000   multi threaded
   //Ippi:   7.296000 + 7.312000   multi threaded, with ippSetNumThreads(1)
   //Ippi:  16.450000              flood threaded
   //Ippi:  14.482000              flood threaded, with ippSetNumThreads(1)
   
   enum IppiThread
   {
       eItSingle,
       eItMulti,
       eItMultiFlood,
   };

   //const size_t nMax = 1000000;
   const size_t nMax = 5000000;
   
   //const IppiThread eIt = eItSingle;
   const IppiThread eIt = eItMulti;
   //const IppiThread eIt = eItMultiFlood;

   switch (eIt)
   {
   case eItSingle:
      {
         TestIntlIppiImpl(nMax);
      }
      break;

   case eItMulti:
      {
         boost::thread_group threads;
         for (int i = 0; i != 2; ++i)
         {
            threads.create_thread(boost::bind(&TestIntlIppiImpl, nMax / 2));
         }
         
         threads.join_all();
      }
      break;

   case eItMultiFlood:
      {   
         long lContinue = 1; 

         boost::thread thread1(&TestIntlIppiImplFlood, &lContinue);
         boost::thread thread2(&TestIntlIppiImpl, nMax);
        
         thread2.join();

         BOOST_INTERLOCKED_EXCHANGE(&lContinue, 0);

         thread1.join();
      }
      break;

   default:
       _ASSERT(false);
       break;
   }

   return 0;
}


//----------------------------------------------------------------------------
// Function TestIntlIppiImpl
//----------------------------------------------------------------------------
// Description  : test ippi impl.
//----------------------------------------------------------------------------
void TestIntlIppiImpl(size_t nMax)
{
    const int nWidth            = 320;
    const int nHeight           = 200;
    int       nStepSizeSource   = 0;
    int       nStepSizeTarget   = 0;
    int       nStepSizeSubtract = 0;

    IppiSize roiSize = {nWidth, nHeight};
    
    nmbyte* pImageBufferSource   = ippiMalloc_8u_C1(nWidth, nHeight, &nStepSizeSource);
    nmbyte* pImageBufferTarget   = ippiMalloc_8u_C1(nWidth, nHeight, &nStepSizeTarget);
    nmbyte* pImageBufferSubtract = ippiMalloc_8u_C1(nWidth, nHeight, &nStepSizeSubtract);
    
    ippiImageJaehne_8u_C1R(pImageBufferSource,   nStepSizeSource,   roiSize); 
    ippiImageJaehne_8u_C1R(pImageBufferTarget,   nStepSizeTarget,   roiSize); 
    ippiImageJaehne_8u_C1R(pImageBufferSubtract, nStepSizeSubtract, roiSize); 

    for (size_t n = 0; n != nMax; ++n)
    {
        ippiSub_8u_C1RSfs(pImageBufferSubtract, nStepSizeSubtract, pImageBufferSource, nStepSizeSource, pImageBufferTarget, nStepSizeTarget, roiSize, 1);
    }
    
    ippiFree(pImageBufferSubtract);
    ippiFree(pImageBufferTarget);
    ippiFree(pImageBufferSource);
}


//----------------------------------------------------------------------------
// Function TestIntlIppiImplFlood
//----------------------------------------------------------------------------
// Description  : flood cpu
//----------------------------------------------------------------------------
void TestIntlIppiImplFlood(long* pContinue)
{
    //do not use synchronisation like condition variables,
    //because they relinquish the processor

    for (;;)
    {
        if (!(*pContinue))
        {
            break;
        }
    }
}
[/cpp]

pvonkaenel · ‎10-28-2009

Quoting - Vladimir Dudnik (Intel)

Hi Peter,

the problem with TBB will be the same as for OpenMP. Not every application use TBB threading API. It is just not possible to provide threaded IPP for every threading APIs. Instead we do recommend to use IPP threading when it make sense (I've heard a lot positive feedback on it) and use not threaded IPP libraries when you want tohave fullcontrol on threading in your application (andin this caseyou can choose whatever API you like).

Regards,
Vladimir

Fair enought, but it seemed to be worth asking for :). Along these lines, could you outline how ippiYCbCr420ToCbYCr422_Interlace_8u_P3C2R() is internally OMP threaded? I have not been able to figure out the top and bottom cases, and would like to get the threaded gain back. Any chance of making the threaded code layer src available to make it easier to port certain routines to other threading architectures? Another long shot, but again worth asking for.

Thanks,
Peter

Rob_Ottenhoff · ‎10-29-2009

Hi All,

I understand that IPP cannot support all kinds of threading API's. But the whole 'raison d'tre' of IPP is its speed. If a more sophisticated threading strategy can make it faster, if only for a subset of users, that would be nice.

To see what TBB could do it wrote a little test. I parallelized the inner loop of gast128 above with TBB, and let the various scenorio's run. The results are below.

As you see boost::thread and TBB are about equivalent when the CPU is not flooded. But TBB certainly eases the pain when it is ( 19.7s to 25.5s ). The last case where boost,TBB and IPP all run in mutiple threads is dramatic so care is needed. I don't have a quad-core available at the moment, but when I have I will see what effect it has.

Conlusion: a different threading strategy, like the one of TBB, can make IPP faster when the CPU is loaded by other threads.

NT = 1 single: 25.4549
NT = 2 single: 14.2245
NT = 1 multi: 12.8907
NT = 2 multi: 14.4793
NT = 1 TBB: 12.903
NT = 2 TBB: 14.5953
NT = 1 multi flood: 25.4987
NT = 2 multi flood: 28.7594
NT = 1 TBB flood: 19.6604
NT = 2 TBB flood: 164.819

Where:
NT = number of threads of IPP.
single = single threaded.
multi = 2 boost threads.
multi flood = 1 boost thread + 1 boost thread flooding the CPU
TBB = parallel met TBB
TBB flood = 1 boost thread calling parallel TBB + 1 boost thread flooding the CPU

Regards,

Rob

Vladimir_Dudnik · ‎10-29-2009

My feeling is that it isall about additional layer built on top of IPP which may add more benefits then just use of better threading technique inside of IPP functions.

Consider any real life task whichwill usuallyconsist from several calls to IPP (for example, sobel filter where you need calculate vertical and horizontal derivatives, take an absolute value ofthem and add them together to form output image with edges). Do you think it is better to parallelize each of these primitive operations independently (as it would be made with threaded IPP functions) or it is better to build threading on top of IPP functions where you can balance not only each core workload (by knowing for example computational complexity of each operation) but also process data in small enough chunks to keep all processed data in L2 cache?

That is what we try to implement with DMIP layer.

Regards,
Vladimir

Mikael_Grev · ‎10-29-2009

Quoting - Vladimir Dudnik (Intel)

My feeling is that it isall about additional layer built on top of IPP which may add more benefits then just use of better threading technique inside of IPP functions.

Consider any real life task whichwill usuallyconsist from several calls to IPP (for example, sobel filter where you need calculate vertical and horizontal derivatives, take an absolute value ofthem and add them together to form output image with edges). Do you think it is better to parallelize each of these primitive operations independently (as it would be made with threaded IPP functions) or it is better to build threading on top of IPP functions where you can balance not only each core workload (by knowing for example computational complexity of each operation) but also process data in small enough chunks to keep all processed data in L2 cache?

That is what we try to implement with DMIP layer.

Regards,
Vladimir

IMO every parallelizable method should use TBB (if as good as Fork/Join) to run at full speed under all core loads. They should also be runnable in one thread (naming convention or extra argument) to facilitate custom usage of TBB by the user in the way you mention.

There's no need for simple core spanning OpenMP. Method local TBB or Fork/Join cover all advantages of OpenMP and it brings a lot more.

Cheers,
Mikael

Vladimir_Dudnik · ‎10-29-2009

I would not argue with this butpeople who use OpenMP in their applications will ...

Vladimir

Mikael_Grev · ‎10-29-2009

Quoting - Vladimir Dudnik (Intel)

I would not argue with this butpeople who use OpenMP in their applications will ...

Vladimir

Since OpenMP is a compiler directive, can't the implementation be switched to whatever you choose? The only demand is that it is suppose to take advantage of multiple processors. A better implementation should not break backwards compatibility. Then, I don't have the source code for IPP, so I can't really tell all the details.

Cheers,
Mikael

David_M_Intel3 · ‎11-03-2009

Thanks for all of your inputs on this thread. We are certainly open to explore a different threading model within IPP (other than OpenMP). Understanding the usage model helps define the parameters for the selection. The usage model that began this thread describes a case where IPP functions are called concurrently from two threads; we will certainly consider this as we evaluate threading models within IPP. In the meantime, what other usage models do you want or use? What types of controls do you want over the threading within IPP? Please add your feedbacks here.

Additionally, here are some more background on a couple of threading models and then finally on IPP and Intel libraries. First, let me compare two popular threading abstractions: OpenMP and Intel Threading Building Blocks.

OpenMP originated from the HPC (high performance computing) community. It supports both Fortran and C. It is principally pragma/directive based. Coming from the HPC community, OpenMP takes a greedy approach when a process enters an OpenMP parallel region, it assumes all the system resources belong to it OpenMP will schedule the work across all the number of cores/processors on the system. This is the default behavior for Intels OpenMP runtime library.

Intel Threading Building Blocks builds on top of generic programming model. It targets C++ developers and is template based. The Threading Building Blocks library contains many common parallel algorithms as well as a number of parallel containers. Threading Building Blocks uses a Cilk-style work stealing algorithm. For the advanced users there are some highly tuned synchronization functions.

The Threading Building Blocks is entirely based on C++ object oriented programming; it has a rich set of constructs and is more flexible than OpenMP. When OpenMP model is applicable it can be added to your code with very few code changes. See http://software.intel.com/en-us/articles/intel-threading-building-blocks-openmp-or-native-threads/ for a more complete comparison of Intel Threading Building Blocks and OpenMP. Both Threading Building Blocks and OpenMP are part of the Intel Compiler C++ Pro product. Threading Building Blocks is also available separately and works with other compilers as well.

Second, the Intel IPP library is built using the Intel OpenMP runtime library for threading. When two threads each call an IPP function that is threaded, each invocation of a threaded IPP function will fork off a number of threads to match the number of cores and you can end up with oversubscription (more threads than cores). If you are doing this, we recommend that you override the default greedy behavior of the OpenMP runtime library in IPP (ippSetNumThreads(1);). See http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-threading-openmp-faq/ for more details on Intel IPP and OpenMP.

Ying Song,
Consulting and support for Intel Performance Libraries
and
David Mackay, Ph.D.
Consulting and support for Performance Analysis and Threading

pvonkaenel · ‎11-04-2009

Quoting - Vladimir Dudnik (Intel)

I would not argue with this butpeople who use OpenMP in their applications will ...

Vladimir

I think that is where the Intel IPP layer approach really shines - you should be able to have multiple versions of a layer, and then let the users decide what best fits their needs. I need TBB, so do not use a threading layer at all - it's sad, but true.

Peter

Mikael_Grev · ‎11-04-2009

Quoting - David Mackay (Intel)

Thanks for all of your inputs on this thread. We are certainly open to explore a different threading model within IPP (other than OpenMP). Understanding the usage model helps define the parameters for the selection. The usage model that began this thread describes a case where IPP functions are called concurrently from two threads; we will certainly consider this as we evaluate threading models within IPP. In the meantime, what other usage models do you want or use? What types of controls do you want over the threading within IPP? Please add your feedbacks here.

Additionally, here are some more background on a couple of threading models and then finally on IPP and Intel libraries. First, let me compare two popular threading abstractions: OpenMP and Intel Threading Building Blocks.

OpenMP originated from the HPC (high performance computing) community. It supports both Fortran and C. It is principally pragma/directive based. Coming from the HPC community, OpenMP takes a greedy approach when a process enters an OpenMP parallel region, it assumes all the system resources belong to it OpenMP will schedule the work across all the number of cores/processors on the system. This is the default behavior for Intels OpenMP runtime library.

Intel Threading Building Blocks builds on top of generic programming model. It targets C++ developers and is template based. The Threading Building Blocks library contains many common parallel algorithms as well as a number of parallel containers. Threading Building Blocks uses a Cilk-style work stealing algorithm. For the advanced users there are some highly tuned synchronization functions.

The Threading Building Blocks is entirely based on C++ object oriented programming; it has a rich set of constructs and is more flexible than OpenMP. When OpenMP model is applicable it can be added to your code with very few code changes. See http://software.intel.com/en-us/articles/intel-threading-building-blocks-openmp-or-native-threads/ for a more complete comparison of Intel Threading Building Blocks and OpenMP. Both Threading Building Blocks and OpenMP are part of the Intel Compiler C++ Pro product. Threading Building Blocks is also available separately and works with other compilers as well.

Second, the Intel IPP library is built using the Intel OpenMP runtime library for threading. When two threads each call an IPP function that is threaded, each invocation of a threaded IPP function will fork off a number of threads to match the number of cores and you can end up with oversubscription (more threads than cores). If you are doing this, we recommend that you override the default greedy behavior of the OpenMP runtime library in IPP (ippSetNumThreads(1);). See http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-threading-openmp-faq/ for more details on Intel IPP and OpenMP.

Ying Song,
Consulting and support for Intel Performance Libraries
and
David Mackay, Ph.D.
Consulting and support for Performance Analysis and Threading

Thank you for your evaluation David,

I appreciate that someone with good knowledge about threading and parallel computing is looking into this.

As I know you know the current IPP OpenMP implementation is almost only good for a server with sole purpose to only do processing using IPP. In every other scenario you will occationally to always excercise one or many cores to the level when that algorithm is sub optimal. Just by saturating 50% of the cores you get half the speed of non OpenMP code which should say everything (I know you understand this).

For instance, any client computer is out of bounds for a developers that knows this since you don't exactly control the cores on a client computer. Everything from the OS to Photoshop will be running in the background, all outside the application's control since the user selects the apps he runs.

Therefore I suggest you implement TBB as a layer that can be swaped in instead of the current OpenMP one.

If TBB is as good as Fork/Join this should be trivial. If not, then Fork/Join should be implemented in C++ and used.

Parallel coputing is the future (sounds tacky, but is true) and you will have to do this going forward anyway. It's better to do it now since that will earn the respect that IPP needs to make it in a multi core enviroment.

I could prove this by benchmarking IPP with 1 to X cores saturated and compare the results to a Fork/Join equivalent and the non threaded version. The result wouldn't be pretty as I know you know.

If IPP is intended for the future, please make it easy for developers to get great performance under any load without coding with TBB themselves (which I know quite few developers can. Parallel computing is hard unless you do it frequently).

Cheers,
Mikael

Vladimir_Dudnik · ‎11-06-2009

I have to point it out that ippSetNumThread is not justone bitswitch to turn on or off IPP internal threading. The function actually set number of theads to be launched by IPP. That basically allow you to left desired number of cores for any other backround work you have in system.

Regards,
Vladimir

Mikael_Grev · ‎11-06-2009

Quoting - Vladimir Dudnik (Intel)

I have to point it out that ippSetNumThread is not justone bitswitch to turn on or off IPP internal threading. The function actually set number of theads to be launched by IPP. That basically allow you to left desired number of cores for any other backround work you have in system.

Regards,
Vladimir

Hello Vladimir,

I don't think you fully understand the problem. This is not about thread numbers or their allocations but about peformance in a non determinisitic system regarding core load. This includes all desktop computers that have an actual user that can do what he want with the computer. You cannot monitor the user, see what apps he runs and adjust the threads accordingly. Fork/Join and TBB is self adjusting under these circumstances where IPP OpenMP is not.

I would suggest you also view the screencast linked above. It is very clear on the difference.

Cheers,
Mikael

Vladimir_Dudnik · ‎11-06-2009

Mikael,

I do understand that TBB's task stealing mechanism will try tokeep cores equally loaded with TBB tasks.There is potential benefitwhen all applicationsuser may launch on computer will share the same TBB scheduler.The question is what if not all applications arebased on TBB?

Regards,
Vladimir

Mikael_Grev · ‎11-06-2009

Quoting - Vladimir Dudnik (Intel)

Mikael,

I do understand that TBB's task stealing mechanism will try tokeep cores equally loaded with TBB tasks.There is potential benefitwhen all applicationsuser may launch on computer will share the same TBB scheduler.The question is what if not all applications arebased on TBB?

Regards,
Vladimir

Vladimir,

Yes, that is a very good question indeed. Something that probably needs investigating before making a decision on this.

Then again it is hard to interpret the answer. Does a high number if TBB users mean that many feel IPP OpenMP is inadequate and they roll their own layer, or do most use their own threading model to really tune the best out of their application by managing threads manually? Similarily, does a low TBB usage mean that developers are ignorant of the TBB model (doesn't know about it or doesn't have the time to learn), doesn't like it (implementation wise) or are they satisfied with the IPP OpenMP model?

These are complicated questions indeed, but very interesting.

My take on this is that a few really smart people should do the work so that the mass can benefit the most. That is the way to do great business since it gives real incentive to buy IPP. You get the most back.

Cheers,
Mikael

gast128 · ‎03-18-2010

Just purchased the Intel book 'Multi-Core Programming'. A paragraph is even devoted to this problem: chapter 11, 'Parallel Program Issues When Using Parallel Libraries'. If I read it correctly the authorrecommends to disable threading altogether when using threads of ur own. This might not even be a performance issue;results may be incorrect.

Ofc the most flexible solution would be (author also mentions this) that parallel libraries share the same task dividing framework. In our application we use TBB and sometimes raw Windows threads. The IPP libraries seem to use Intel's private OpenMP. So there is little chance that these libraries communicate with each other about the optimal task division (to prevent over subscription).

Vladimir_Dudnik · ‎03-19-2010

My comment on this is thatit is developer responsibility to design software to avoid thread oversubscription. Even when one use Windows system threading and TBB it also may cause oversubscription. And from other hand, everything is under you control. You choose what techniques, libraries and tools to use and in what manner to solver your task. And so it is possible to avoid oversubscription problems like in cases mentioned above (system therading + TBB or OpenMP therading + TBB). Learn the tools, knowtheir potential and limitations and apply correctly, that all you need.

Regards,
Vladimir

gast128 · ‎03-19-2010

Yes but ideally the libraries would solve this themselves. If there was just one task library on ur system, it could do the management. Ofc u could always create aditionaly threads and flood them outside this task library, but still it seems a problem which might be solvable (perhaps in OS? Still u don't want kernel transitions, they tend to be heavy too). In this way IPP could spawn as many parallel calculations as it would request. The shared task library would prevent the oversubscription.

Vladimir_Dudnik · ‎03-19-2010

In theory, there is no difference between theory and practice. But, in practice, there is...

On practice, fortunately, there are many operating systems and each have many implementation of task systems. And because of thatyou can feel the difference. I do not think there is a chance for single universal and unified solutionwhichequally efficient for everything.

Of course, if library is flexible enough (like Intel IPP), it is possible to adopt to several task systems. That is what we demostrate with IPP sample applications.

Regards,
Vladimir