Re: IPP with multithreaded applications

gast128 · ‎10-26-2009

Dear all,

we use IPP (5.3.4) within a data acquistion apllication. Besides some other threads it consist of two 'main' threads:
- a data acquisiton thread in which images are convertedto object features
- a GUI thread in which post processing takes place (e.g. file writing, image display)

Both threads make use of IPP, but also both threads donot use the CPU for 100%. It seems that IPP use local parallellism in most functions. This indeed makes the call faster (twice) on dual / multicore machines. However it also gives an extra 20-30% processor load (on a DELL T3400 dualcore machine), compared to disabling the threading in IPP (thru 'ippSetNumThreads(1);'). Spying with Windows performance monitor one can notice that the thread context switches increase from an average of 1000 per second to 200000 per second.

This effect seems tolimit the performance of IPP (severly). Is there a recommended strategy to minimize this effect? Can this be circumvented completely? In a test program we made 3 tests (see below):
- single threaded
- IPP used from 2 threads
- IPP used from 1 thread, with a second thread flooding the CPU completely but not using IPP

In the last 2 options one can notice that IPP is actually slower without the 'thru 'ippSetNumThreads(1);' call.

thx in advance.

P.s. 1:we checked that IPP 5.3.4 is correctly loaded, e.g. it is using 'ippip8-5.3.dll' on my dual core machine.
P.s. 2: code:

[cpp]#include 
#include 
#include 

#pragma comment(lib, "ipps.lib")
#pragma comment(lib, "ippcore.lib")
#pragma comment(lib, "ippi53.lib")


int main()
{
   //performance will be half on dual core
   //ippSetNumThreads(1);

   //Ippi:   7.733000              single threaded     
   //Ippi:  14.389000              single threaded, with ippSetNumThreads(1)
   //Ippi:   8.640000 + 8.812000   multi threaded
   //Ippi:   7.296000 + 7.312000   multi threaded, with ippSetNumThreads(1)
   //Ippi:  16.450000              flood threaded
   //Ippi:  14.482000              flood threaded, with ippSetNumThreads(1)
   
   enum IppiThread
   {
       eItSingle,
       eItMulti,
       eItMultiFlood,
   };

   //const size_t nMax = 1000000;
   const size_t nMax = 5000000;
   
   //const IppiThread eIt = eItSingle;
   const IppiThread eIt = eItMulti;
   //const IppiThread eIt = eItMultiFlood;

   switch (eIt)
   {
   case eItSingle:
      {
         TestIntlIppiImpl(nMax);
      }
      break;

   case eItMulti:
      {
         boost::thread_group threads;
         for (int i = 0; i != 2; ++i)
         {
            threads.create_thread(boost::bind(&TestIntlIppiImpl, nMax / 2));
         }
         
         threads.join_all();
      }
      break;

   case eItMultiFlood:
      {   
         long lContinue = 1; 

         boost::thread thread1(&TestIntlIppiImplFlood, &lContinue);
         boost::thread thread2(&TestIntlIppiImpl, nMax);
        
         thread2.join();

         BOOST_INTERLOCKED_EXCHANGE(&lContinue, 0);

         thread1.join();
      }
      break;

   default:
       _ASSERT(false);
       break;
   }

   return 0;
}


//----------------------------------------------------------------------------
// Function TestIntlIppiImpl
//----------------------------------------------------------------------------
// Description  : test ippi impl.
//----------------------------------------------------------------------------
void TestIntlIppiImpl(size_t nMax)
{
    const int nWidth            = 320;
    const int nHeight           = 200;
    int       nStepSizeSource   = 0;
    int       nStepSizeTarget   = 0;
    int       nStepSizeSubtract = 0;

    IppiSize roiSize = {nWidth, nHeight};
    
    nmbyte* pImageBufferSource   = ippiMalloc_8u_C1(nWidth, nHeight, &nStepSizeSource);
    nmbyte* pImageBufferTarget   = ippiMalloc_8u_C1(nWidth, nHeight, &nStepSizeTarget);
    nmbyte* pImageBufferSubtract = ippiMalloc_8u_C1(nWidth, nHeight, &nStepSizeSubtract);
    
    ippiImageJaehne_8u_C1R(pImageBufferSource,   nStepSizeSource,   roiSize); 
    ippiImageJaehne_8u_C1R(pImageBufferTarget,   nStepSizeTarget,   roiSize); 
    ippiImageJaehne_8u_C1R(pImageBufferSubtract, nStepSizeSubtract, roiSize); 

    for (size_t n = 0; n != nMax; ++n)
    {
        ippiSub_8u_C1RSfs(pImageBufferSubtract, nStepSizeSubtract, pImageBufferSource, nStepSizeSource, pImageBufferTarget, nStepSizeTarget, roiSize, 1);
    }
    
    ippiFree(pImageBufferSubtract);
    ippiFree(pImageBufferTarget);
    ippiFree(pImageBufferSource);
}


//----------------------------------------------------------------------------
// Function TestIntlIppiImplFlood
//----------------------------------------------------------------------------
// Description  : flood cpu
//----------------------------------------------------------------------------
void TestIntlIppiImplFlood(long* pContinue)
{
    //do not use synchronisation like condition variables,
    //because they relinquish the processor

    for (;;)
    {
        if (!(*pContinue))
        {
            break;
        }
    }
}
[/cpp]

Vladimir_Dudnik · ‎10-27-2009

There is no magic. If your system have two cores then it is able to do only two threads simulteneously. If number of active threads (which do load CPU)in your applications more than number of physically available cores then some threads will wait for their time slice. And this will lower overall application performance. In this case we do recommend to use single threaded IPP libraries or disable threading in multithreaded IPP libraries.

On systems with bigger number of core there is an opportunity to balance, for example of 4 or 8 cores system, you may enable for example 2 threads for IPP and use rest threads for your application needs.

Just need to avoid thread oversubscription situations.

Regards,
Vladimir

Mikael_Grev · ‎10-27-2009

Hello,

It looks like IPP is using a quite simple threading algorithm, basically dividing the workload onto a number of threads. I may be wrong, but evidence above point to this.

There are many techniques today that divides workloads in such a way that it doesn't matter if one or more cores are already saturated. Also, they handle the problem of non equal work loads for every "work unit" (e.g. Work-stealing).

Does Intel have any plan to leverage any of these more modern algorithms or will you stick to the simplistic forking used now for the near future?

IMHO this is key to good performance. I have myself turned of threading in IPP since it is too easy to disrupt unless you lock out everythig exept the IPP workloads.

Here's a good read for one of these algorithms (even though this one is for Java it is very informative): http://gee.cs.oswego.edu/dl/papers/fj.pdf

Cheers,
Mikael Grev

pvonkaenel · ‎10-27-2009

Quoting - mikaelgrev

Hello,

It looks like IPP is using a quite simple threading algorithm, basically dividing the workload onto a number of threads. I may be wrong, but evidence above point to this.

There are many techniques today that divides workloads in such a way that it doesn't matter if one or more cores are already saturated. Also, they handle the problem of non equal work loads for every "work unit" (e.g. Work-stealing).

Does Intel have any plan to leverage any of these more modern algorithms or will you stick to the simplistic forking used now for the near future?

IMHO this is key to good performance. I have myself turned of threading in IPP since it is too easy to disrupt unless you lock out everythig exept the IPP workloads.

Here's a good read for one of these algorithms (even though this one is for Java it is very informative): http://gee.cs.oswego.edu/dl/papers/fj.pdf

Cheers,
Mikael Grev

Internally, IPP is using OpenMP. I would recommend turning off the internal threading in favor of devising your own threading - I use TBB to thread my IPP code, but manually specifying OpenMP threading works well also. I've found that this approach, while requiring more coding, leaves me in total control over the parallel aspects of my application.

Peter

pvonkaenel · ‎10-27-2009

Quoting - pvonkaenel

Internally, IPP is using OpenMP. I would recommend turning off the internal threading in favor of devising your own threading - I use TBB to thread my IPP code, but manually specifying OpenMP threading works well also. I've found that this approach, while requiring more coding, leaves me in total control over the parallel aspects of my application.

Peter

I forgot to mention that only about 20%-30% of IPP is threaded instead of most of it as you mentioned. The library includes a ThreadedFunctionList.txtwhich outlines exactly which function are threaded.

gast128 · ‎10-27-2009

Quoting - Vladimir Dudnik (Intel)

There is no magic. If your system have two cores then it is able to do only two threads simulteneously. If number of active threads (which do load CPU)in your applications more than number of physically available cores then some threads will wait for their time slice. And this will lower overall application performance. In this case we do recommend to use single threaded IPP libraries or disable threading in multithreaded IPP libraries.

On systems with bigger number of core there is an opportunity to balance, for example of 4 or 8 cores system, you may enable for example 2 threads for IPP and use rest threads for your application needs.

Just need to avoid thread oversubscription situations.

Regards,
Vladimir

Thx.

The problem is ofc in the word 'oversubsciption': sometimes the other thread is busy and then the cores get oversubscribed, but sometimes the thread is waiting. Disabling then the use of more threads in IPP would give a performance penalty.

Mikael_Grev · ‎10-27-2009

Well, OpenMP is afaik more an easy way to fork and barrier threads than an efficient high level algorithm to divide tasks. How to divide tasks properly is quite complex. Please read the paper i linked to for more info.

IMO, the simple divide algo that is used by IPP (again, as I understand it from reading other threads) is only appropriate for very simple tasks where IPP gets all the attention. It is not suitable in a larger system where threads are used extensively and unevenly, like on a desktop system where the user can run many applications, which have threads you can't control. With an algorithm like Fork/join or Cilk IPP threading would be usable for a lot more use cases.

Cheers,
Mikael Grev

pvonkaenel · ‎10-27-2009

Quoting - mikaelgrev

Well, OpenMP is afaik more an easy way to fork and barrier threads than an efficient high level algorithm to divide tasks. How to divide tasks properly is quite complex. Please read the paper i linked to for more info.

IMO, the simple divide algo that is used by IPP (again, as I understand it from reading other threads) is only appropriate for very simple tasks where IPP gets all the attention. It is not suitable in a larger system where threads are used extensively and unevenly, like on a desktop system where the user can run many applications, which have threads you can't control. With an algorithm like Fork/join or Cilk IPP threading would be usable for a lot more use cases.

Cheers,
Mikael Grev

This is why I mentioned TBB - from briefly skimming the paper you reference, it sounds like several of the Java fork/join framework facilities are implemented in TBB. This is what I use for threading my IPP based application, and I've gotten fairly good results with it. As for applications that have non-TBB threads, you do need to manage oversubscription prevention yourself, and I have no idea what to do when other applications in their own address space are also using system resources.

Let me know if you have suggestions.

Peter

Mikael_Grev · ‎10-27-2009

Peter,

Yes, TBB seems like a similar thing.

Though I am wondering if the IPP libraries will use TBB themselves to get better performance under different circumstances. If I use TBB I guess I have to manage the threading myself. I could do that, but then it would be better if IPP was TBB-ized.

Cheers,
Mikael

Btw, I found some online video if anyone is interested. It's a good talk by Brian Goetz: http://www.infoq.com/presentations/brian-goetz-concurrent-parallel

pvonkaenel · ‎10-27-2009

Quoting - mikaelgrev

Peter,

Yes, TBB seems like a similar thing.

Though I am wondering if the IPP libraries will use TBB themselves to get better performance under different circumstances. If I use TBB I guess I have to manage the threading myself. I could do that, but then it would be better if IPP was TBB-ized.

Cheers,
Mikael

Btw, I found some online video if anyone is interested. It's a good talk by Brian Goetz: http://www.infoq.com/presentations/brian-goetz-concurrent-parallel

I agree that TBB IPP would be nice, but I doubt it would happen since OpenMP is a compiler technology built into the Intel compiler, while TBB is a 3rd party library (Intel is the 3rd party, but still a 3rd party). I guess it doesn't hurt to ask though. In theory, I guess they could just add a second threading layer that uses TBB instead of OpenMP and allow the users to select the threading model that fits the rest of their system. I would be very interested in that, and it should not interfere with others who are already locked into the OpenMP threading layer.

Peter

Mikael_Grev · ‎10-27-2009

Then we concur Peter!

Vladimir, what do you say, any chance?

Cheers,
Mikael

Vladimir_Dudnik · ‎10-27-2009

Mikael,

I did not get your point. What sense do you see in all those smart andself balancingthreading algorithms you mention when we talk about IPP functions? Let's consider ippiAdd_8u_C1R function, just for example. I do not see more efficient way to parallelize such a simple workload than OpenMP. And some people find it useful.
But when we consider more complicated things, something like what mentioned in the beginning of this thread, data acquisition and analysis in different parts of parallel application then I would completely agree that more smart threading approaches should be used to better balance system performance. But this jobis not charter ofIPP functions, is not it?

Regards,
Vladimir

Rob_Ottenhoff · ‎10-28-2009

Quoting - Vladimir Dudnik (Intel)

Mikael,

I did not get your point. What sense do you see in all those smart andself balancingthreading algorithms you mention when we talk about IPP functions? Let's consider ippiAdd_8u_C1R function, just for example. I do not see more efficient way to parallelize such a simple workload than OpenMP. And some people find it useful.
But when we consider more complicated things, something like what mentioned in the beginning of this thread, data acquisition and analysis in different parts of parallel application then I would completely agree that more smart threading approaches should be used to better balance system performance. But this jobis not charter ofIPP functions, is not it?

Regards,
Vladimir

Hi Vladimir,
Well, that is an easy way out! First you advertise the parallelism of IPP, but when you use it in a real application like above you say: 'Ah that's not smart, just turn it off and do it yourself!' I think the suggestion of Peter and Michael to enable the use of TBB ( which is an Intel product, so why not? ) makes sense, that way your customers have more choices than just on or off.

Regards,
Rob (btw I am a colleague of gast128)

Mikael_Grev · ‎10-28-2009

Quoting - Vladimir Dudnik (Intel)

Mikael,

I did not get your point. What sense do you see in all those smart andself balancingthreading algorithms you mention when we talk about IPP functions? Let's consider ippiAdd_8u_C1R function, just for example. I do not see more efficient way to parallelize such a simple workload than OpenMP. And some people find it useful.
But when we consider more complicated things, something like what mentioned in the beginning of this thread, data acquisition and analysis in different parts of parallel application then I would completely agree that more smart threading approaches should be used to better balance system performance. But this jobis not charter ofIPP functions, is not it?

Regards,
Vladimir

Vladimir,

The Fork/Join algorithm is same as the generic and very old and simple Divide and Concur algorithm (and related to Map/Reduce). It is not in any way advanced and it have an extremely small overhead. Yet it is, at least in the Fork/Join implementation, self balansing in a couple of ways. It is almost as simple as the OpenMP way of just dividing the workload between threads. The specific smartness of Doug Lea's Fork/Join is that it is almost without thread locking and barriers which increses performance a lot. Btw, that framwork will be included in Java 7.

If you haven't already I really think you should look at the Goets screencast linked above. It is very informative and not that advanced. Further discussion on the subject almost demands this knowledge.

IPP threading is simply not usable in a typical desktop scenario where you aren't i control over all the cores. It is however very optimized towards benchmarks. When I read the threads on this forum I also get the feeling that people don't use IPP OpenMP and instead use their own threading algos.

I suggest that you add an internal benchmarking set where say 30% of the cores are busy when you do the benchmark. That way you can see the benefit that Fork/Join would give.

Even simple operations like memcpy, memset and ippiAdd_8u_C1R benefit from Fork/Join compared to just dividing the array into equal parts.

I would sugest that you at least do a study on how much you would gain.

Performance is realy tricky, especially so when you mix in parallelism. But trust me, Fork/Join is the way to go even for simple tasks, at least when you need to play nice with other processes (which you always have to outside the lab :)

Also, as a testiment that Fork/Join is not heavyweight the code is only some 800 lines in Java.

Cheers,
Mikael Grev
MiG InfoCom AB

Vladimir_Dudnik · ‎10-28-2009

..........

Vladimir_Dudnik · ‎10-28-2009

Hi Rob and Mikael,

yes, IPP use simple threading, which was proven to work for some applications. And yes, you may want to use another threading model in your application (fortunately it is possible).

I did read an article you pointed out and generally would agree with what it state. It is just not directly applicable to IPP beacuse still there is no single 'ideal' threading approach which perfectly work for everyone needs. BTW, Intel TBB use similar job stealing approach I believe. We also provide Deferred Mode Image Processing (DMIP)layer as a part of IPP samples package. The DMIP is a pipelining and threading layer built on top of IPP to combine advantages from IPP kernels performance, threading on above IPP level, chaining a computational task into sequence of calls to IPP and subdividinga task into smaller pieces in such a way that data processed are kept hot in cpu L2 cache (especially when they are reused between IPP calls).

I'm not argue that current multi core and future many core arhitectures create a demand for parallel frameworks or even languages which would simplify programming of such complex systems.

Regards,
Vladimir

pvonkaenel · ‎10-28-2009

Hi Vladimir,

If I understand the IPP architecture correctly, threading is a layer on top of the processing, correct? How difficult would it be to provide alternate threading layer DLLs? Currently there are _t.dll files. Could TBB threading layer DLLs be introduced as _tbb.dll in addition to the existing _t.dll files so that users can choose? I have already discovered several advantages of TBB over OMP, and have moved away from the _t.dll layer because of it - don't want two threading systems to oversubscribe the system. I would like to suggest the same thing for the UMC audio-video-codecs samples.

Thanks for your consideration,
Peter

Vladimir_Dudnik · ‎10-28-2009

Hi Peter,

the problem with TBB will be the same as for OpenMP. Not every application use TBB threading API. It is just not possible to provide threaded IPP for every threading APIs. Instead we do recommend to use IPP threading when it make sense (I've heard a lot positive feedback on it) and use not threaded IPP libraries when you want tohave fullcontrol on threading in your application (andin this caseyou can choose whatever API you like).

Regards,
Vladimir

Mikael_Grev · ‎10-28-2009

The core or the argumenation is:
With OpenMP in IPP the processing will take longer than the version without threading if not all cores are free during the call. Fork/Join (and possibly TBB) will never take longer than a single threaded version if at least one core is free. That is the difference. Thus, Fork/Join is much more compatible with the surrounding environment and you are much less likely to have to turn it off.

Btw, I have just nu confirmed this with a simple test case where I saturated one of two cores and ran the Jpeg codec with and without OpenMP. (I also tested without the saturation and then I see both cores fill to 90% and times are improved with about 70%, so OpenMP is working).

Cheers,
Mikael

Vladimir_Dudnik · ‎10-28-2009

Right, OpenMP is working. And I think the more interested case for consideration is about 4 or 8 (or 24) cores system where you direct IPP to use for example 2 threads onlyand then do whatever you want on application level (also keeping in mind oversubscription issue). Although it brings need for thread affinity functionality, which is not currently available in IPP but is considered to be added in future releases.

Vladimir

Mikael_Grev · ‎10-28-2009

Quoting - Vladimir Dudnik (Intel)

Right, OpenMP is working. And I think the more interested case for consideration is about 4 or 8 (or 24) cores system where you direct IPP to use for example 2 threads onlyand then do whatever you want on application level (also keeping in mind oversubscription issue). Although it brings need for thread affinity functionality, which is not currently available in IPP but is considered to be added in future releases.

Vladimir

Vladimir,

I think we are locked into positions. You misread my last post and I don't think I can explain the problem well enough. Threading and the problems around it is a hard topic. Sorry. I will instead post something in the premium support. Or try to get a hold of someone who is a parallel expert within Intel.

Thanks for trying.

Thanks,
Mikael