Using OpenMP on unthreaded IPP routines

pvonkaenel · ‎02-24-2009

I started using the threaded version of the IPP libraries and saw a gain when using some of the conversion routines such as ippiCbYCr422ToYCbCr420_Interlace_8u_C2P3R(), however not all conversion routines are threaded. So I thought I'd get fancy by putting a conversion routine in a loop and specify OpenMP pragmas to get the same kind of gain I've seen with internally threaded IPP routines. Unfortunately it looks like putting IPP routines within OpenMP threads slows things down.

Has anyone else notices this, and is there a way to OpenMP non-threaded IPP routines?

Thanks,
Peter

Vladimir_Dudnik · ‎02-24-2009

Hi Peter,

threading on top of IPP is one of the correct and for some applications most preferable ways of using IPP.

While threaded IPP libraries does not requiresignificant effort from user to quickly get at least some advantage in performance on multi core system the lack of knowledge on the whole task (which may consist from several IPP calls) may lead to not full or not most efficient utilization of available processor resources.

From the other hand, when you develop your complex task you have a complete understanding on interdependencies of different blocks, on amount of processed data, how the whole processing pipeline looks like and so you may convert that knowledge into task specific threading model which definitely will utilize processor resources better than just sequential calls of threaded IPP functions.

If you take a look at IPP sample package you will find that this way we have implemented IPP codecs (for example JPEG codec).

Of course development of threading model most efficient for your processing task might be a compex job which require some level of knowledge on threading API, experience in parallel programming and so on.

In IPP 6.0 sample package we also offer a new high level library which is built on top of IPP and istargeted to simplify the creation of threaded processing pipeline for image processing tasks.

Regards,
Vladimir

pvonkaenel · ‎02-25-2009

Quoting - Vladimir Dudnik (Intel)

Hi Peter,

threading on top of IPP is one of the correct and for some applications most preferable ways of using IPP.

While threaded IPP libraries does not requiresignificant effort from user to quickly get at least some advantage in performance on multi core system the lack of knowledge on the whole task (which may consist from several IPP calls) may lead to not full or not most efficient utilization of available processor resources.

From the other hand, when you develop your complex task you have a complete understanding on interdependencies of different blocks, on amount of processed data, how the whole processing pipeline looks like and so you may convert that knowledge into task specific threading model which definitely will utilize processor resources better than just sequential calls of threaded IPP functions.

If you take a look at IPP sample package you will find that this way we have implemented IPP codecs (for example JPEG codec).

Of course development of threading model most efficient for your processing task might be a compex job which require some level of knowledge on threading API, experience in parallel programming and so on.

In IPP 6.0 sample package we also offer a new high level library which is built on top of IPP and istargeted to simplify the creation of threaded processing pipeline for image processing tasks.

Regards,
Vladimir

Thank you for your detailed note. Just to make sure I get it, I should either use the threading layer in IPP and not try to use OpenMP around other IPP routines, or use the sequential layer, and use OpenMP to my hearts content. Is this correct? Is it best not to mix IPP threading with my own OpenMP constructs?

Thanks again,
Peter

pvonkaenel · ‎02-25-2009

Quoting - pvonkaenel

Thank you for your detailed note. Just to make sure I get it, I should either use the threading layer in IPP and not try to use OpenMP around other IPP routines, or use the sequential layer, and use OpenMP to my hearts content. Is this correct? Is it best not to mix IPP threading with my own OpenMP constructs?

Thanks again,
Peter

OK, I've run some tests, and I think I'm missing something very basic here. Consider the simple block of code where the image I'm working with is 1920x1080:

Ipp8u *srcPix = static_cast(srcImg.getPixels());
Ipp8u *dstPix = static_cast(dstImg.getPixels());
IppiSize sz;
sz.width = srcImg.getWidth();
sz.height = srcImg.getHeight();
I32 srcStep = srcImg.getStep();
I32 dstStep = dstImg.getStep();
ippiCbYCr422ToYCbCr422_8u_C2R(srcPix, srcStep, dstPix, dstStep, sz);

Let's convert it to a very simple threaded version like the following:

Ipp8u *srcPix = static_cast(srcImg.getPixels());
Ipp8u *dstPix = static_cast(dstImg.getPixels());
IppiSize sz;
sz.width = srcImg.getWidth();
sz.height = 1; // ** The height is now 1 **
I32 srcStep = srcImg.getStep();
I32 dstStep = dstImg.getStep();
#pragma omp parallel for private(srcPix, dstPix)
for (I32 i = 0; i < srcImg.getHeight(); i++) {
srcPix = static_cast(srcImg.getPixel(i, 0));
dstPix = static_cast(dstImg.getPixel(i, 0));
ippiCbYCr422ToYCbCr422_8u_C2R(srcPix, srcStep, dstPix, dstStep, sz);
}

Note that I have the above in a large loop to average out thread creation in the first pass. If I build the above without /openmp then the run-time is slightly longerthan the origonal code. As soon as I compile with /openmp and link with libiomp5md.lib the run-time more than doubles.

Please let me know what my dumb error is.

Thanks,
Peter

Vladimir_Dudnik · ‎02-25-2009

Peter you may mix IPP threading with application threading, it is just a bit more complicated to keep system balanced in this way.

For your sample, please note that threading introduce some overhead, so it only make sense when time your spend in the thread to process piece of data overcome overhead for thread creation/initialisation.

I would recommend you to process the image in slices, say 64..128 rows.

Vladimir

pvonkaenel · ‎02-26-2009

Quoting - Vladimir Dudnik (Intel)

Peter you may mix IPP threading with application threading, it is just a bit more complicated to keep system balanced in this way.

For your sample, please note that threading introduce some overhead, so it only make sense when time your spend in the thread to process piece of data overcome overhead for thread creation/initialisation.

I would recommend you to process the image in slices, say 64..128 rows.

Vladimir

Thanks for your help Vladimir, I'll try out the slices. I also tracked down some of the slowdown problem. My timing test is built into a UnitTest application which

1. Allocates an image
2. Performs the timing test and validates results
3. Deallocates the image
4. Update to the next image size and start again at step 1

It looks like if the image memory is deallocated the slowdown happens. I guess this is because of cache misses in the next pass where all cores fight for bandwidth to the memory (?). If I allocate the memory before my test loop and keep using different size portions of it, it works fine and I get a nice speedup.

While testing I did notice something strange: the ippiYCbCr422ToCbYCr422_8u_C2R() conversion is much faster than ippiCbYCr422ToYCbCr422_8u_C2R(), and no matter how I try to thread it, the sequential version always runs faster. Well, no matter, the sequential version seems fast enough in comparison to the other converters.

Thanks again,
Peter