Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

Threading issue with ippiRemap()

hapatrick
Beginner
650 Views

Does anyone know if ippiRemap() is multi-threaded in IPP version 7.0? It is not listed in Documentation/en_US/ipp/ThreadedFunctionsList.txt, but this list of IPP 7.0 bug fixes:

http://software.intel.com/en-us/articles/intel-ipp-70-library-bug-fixes/

at least implies that ippiRemap() is multi-threaded (see DPD200084964).

This comes up because I'm actually seeing that the time to execute ippiRemap() on my data (4MP, 8-bit images) is *increasing* as I increase the number of threads allocated to IPP. If ippiRemap() is not threaded, I wouldn't expect a noticeable change in the runtime with more threads. Even if it is threaded, I wouldn't expect the runtime to *increase* with more threads. The increase is about 5% going from 1 to 2 threads, and about 40% (!) going from 2 to 4 threads on a CPU with four cores (four actual cores, I'm not counting hyperthreading).

Has anyone had a similar problem with ippiRemap()?

0 Kudos
1 Solution
PaulF_IntelCorp
Employee
650 Views
Yes, ippiRemap() is multi-threaded. There was a mistake in the documentation that did not include this function in the threaded functions list. I believe that documentation issue will be addressed in the next release.

Regarding the increase in runtime as a result of threading: not all functions and/or data sets automatically benefit from threading. There is additional overhead associated with the OpenMP threading engine that can sometimes make things slower, especially if the data set size is small. A 4MP image is not very large, so you may, in fact, be experiencing issues with the overhead -- or you may be experiencing multiple threads forcing additional cache cycles. For example, depending on the size of your CPUs cache, a 4MP image could fit within the cache, but having multiple threads compete for cache resources might effectively reduce the efficiency of the cache hits... this is just one theory, others are also plausible.

Please also look at these articles, which describe a sleep feature of OpenMP, which might also be interfering with your measurements (depends on how you are running the measurement tests):

http://software.intel.com/en-us/articles/high-cpu-usage-and-intel-ipp-threaded-function/

http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/optaps/common/optaps_par_libs.htm

A few key points from the articles above:

- Intermittently calling threaded IPP functions (e.g., 30 times a second) can cause overall performance to drop.
- After completing execution of a parallel region, threads wait for new parallel work to become available. After a certain period of time has elapsed, they stop waiting and sleep.
- The amount of time to wait before sleeping is set by the KMP_BLOCKTIME environment variable.

You can control the number of threads by using the ippSetNumThreads() function. It might also be worth linking against the single-threaded static library to see what sort of performance you get there, to use a reference for one thread.

Regards,

Paul

View solution in original post

0 Kudos
7 Replies
PaulF_IntelCorp
Employee
651 Views
Yes, ippiRemap() is multi-threaded. There was a mistake in the documentation that did not include this function in the threaded functions list. I believe that documentation issue will be addressed in the next release.

Regarding the increase in runtime as a result of threading: not all functions and/or data sets automatically benefit from threading. There is additional overhead associated with the OpenMP threading engine that can sometimes make things slower, especially if the data set size is small. A 4MP image is not very large, so you may, in fact, be experiencing issues with the overhead -- or you may be experiencing multiple threads forcing additional cache cycles. For example, depending on the size of your CPUs cache, a 4MP image could fit within the cache, but having multiple threads compete for cache resources might effectively reduce the efficiency of the cache hits... this is just one theory, others are also plausible.

Please also look at these articles, which describe a sleep feature of OpenMP, which might also be interfering with your measurements (depends on how you are running the measurement tests):

http://software.intel.com/en-us/articles/high-cpu-usage-and-intel-ipp-threaded-function/

http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/optaps/common/optaps_par_libs.htm

A few key points from the articles above:

- Intermittently calling threaded IPP functions (e.g., 30 times a second) can cause overall performance to drop.
- After completing execution of a parallel region, threads wait for new parallel work to become available. After a certain period of time has elapsed, they stop waiting and sleep.
- The amount of time to wait before sleeping is set by the KMP_BLOCKTIME environment variable.

You can control the number of threads by using the ippSetNumThreads() function. It might also be worth linking against the single-threaded static library to see what sort of performance you get there, to use a reference for one thread.

Regards,

Paul

0 Kudos
Griffin_Myers
Beginner
650 Views
Hi Paul,

Can you confirm which release(s) of IPP 7 should contain a multi-threded implementation of ippiRemap()? Looking at the list of bug fixes I see:

IPP v7.0 (12 Aug 2010)
DPD200084964 Multi-thread ippiRemap, ippiDilate3x3, ippiErode3x3, ippiMorphReconstructDilate_32f_C1IR and ippiRotate.

and

IPP v7.0 update 2 (20 Jan 2011)
DPD200084964 ippiResizeSqrPixel (antialiasing mode) and Remap/Rotate/RotateC/WarpAffine/Shear functions are now multi-

(source: http://software.intel.com/en-us/articles/intel-ipp-70-library-bug-fixes/)


This seems to indicate that the same bug was fixed in two different releases, so I'm not clear if the multi-threaded version is in all releases of 7.0 or only in update 2 and later.

I'm currently running update 1a and my timing results do not seem to indicate that ippiRemap() is multi-threaded.
0 Kudos
PaulF_IntelCorp
Employee
650 Views
Hello Griffin,

They were threaded in 7.0.2, but I think they were not listed in the ThreadedFunctionsList.txt file until 7.0.4. So 7.0.2 contains the actual fix and 7.0.4 documents (externally) that fact.

There were some related functions that were threaded in 7.0, I believe that is the reason for the confusion.

Paul
0 Kudos
Griffin_Myers
Beginner
650 Views
Hi Paul,

I have updated to 7.0.4 on 32-bit Windows and the ThreadedFunctionsList.txt still does not indicate that ippiRemap() is threaded, nor do any of my tests show that ippiRemap() is using more that one thread.

I have tried ippSetNumThreads() and the various KMP_XXXXX environment variable settings described in the referenced links. I have tried various image sizes and timed my test code and viewed processor utilitization with task manager and never do I get any indication that more than one thread is utilized.

Could you please provide me with some sample code that will exhibit ippiRemap()'s multithreaded capabilities or could you confirm if multithreading is only expected to work under certain conditions (specific image sizes, etc.).

As a point of reference, I do observe benefits from many other functions listed in ThreadedFunctionsList.txt, so I know that in general the multithreaded IPP functions work for me.

Thanks,
Griffin
0 Kudos
Chao_Y_Intel
Moderator
650 Views
Hi Griffin, Sorry for the confusion here. Our engineer owner noticed an error in a header file, and the threaded versions of the remap and affine functions don't give effect. The fix are planned to add into the next update release. Thanks, Chao
0 Kudos
Griffin_Myers
Beginner
650 Views
Hi Chao, could you please confirm that this issue (i.e. ippiRemap() is not multithreaded as intended) was fixed in 7.0.5? I don't see any mention of it in the bug fix list, so I would like to know that it has been corrected before I upgrade to 7.0.5 and try it out.
0 Kudos
igorastakhov
New Contributor II
650 Views

it is threaded for 64f data type and lanczos interpolation only, for other parametersippiRemap functionality has not been threaded niether in 7.0.5 nor in 7.0.6

Regards,
Igor

0 Kudos
Reply