Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

performance issue ippiCrossCorrNorm_8u32f_C1R

Loos__Stefan
Beginner
771 Views

Hello

I compared the ipp703 call ippiCrossCorrValid_NormLevel_8u32f_C1R

to the 802 call ippiCrossCorrNorm_8u32f_C1R

and measured Timing in endless Loops (all buffers pre-allocated, 1000x1000 Image, 10x10 template)

Results s. Attachment

First Trial: in the Loop a sleep(0) directive was used

The 703 turns out to be 4x faster (!) than the new 802 function, but cpu load is extreme and would not give space to other Tasks in complex applications

Second Trial: in the Loop a sleep(100) directive was used

The 703 still Shows an extremly high cpu load though my calculated cpu use time is only at 5% !! 802 works as expected with few remaining load.

 

Now what shall i do ? I don´t want to use 703 because it seems to be bugous that cpu load is constantly high even after the call finished and the thread is sleeping. The 802 Performance is way poorer however.

Stefan

0 Kudos
6 Replies
Igor_A_Intel
Employee
771 Views

Hi Stefan,

IPP is built with Intel compiler and therefore uses Intel OMP (according to your charts I can make a conclusion that you are using threaded IPP version). There was a bug in Intel OMP that was distributed with IPP 7.0.x - wrong initial value of KMP_BLOCK_TIME global variable - that was the cause of 100% cpu load even when all threads stopped any processing work - they did not go to sleep. For 7.0.x you can try to set this variable to 0 - this may help. As regarding ippiCrossCorrNorm_8u32f_C1R function from 8.0.2 - this is new API and in this IPP version it had not got all optimizations developed for deprecated ippiCrossCorrValid_NormLevel_8u32f_C1R. You can try the latest IPP update (8.2.1 or 8.2.2) - this issue is fixed there, or if you want to stay with 8.0.2 - you can still use deprecated ippiCrossCorrValid_NormLevel_8u32f_C1R. 100% cpu load when some IPP function is working is normal situation - IPP is performance library and therefore uses ALL advantages of ISA (instruction set arch) and HW. If you need some room for some background process when using threaded IPP library - just limit the number of cpus/HW threads used by IPP with ippSetNumThreads() function.

regards, Igor.

0 Kudos
Loos__Stefan
Beginner
771 Views

Hello Igor

Thank you for your detailled explanation

The 703 is proably the threadened Version ( I dont know there is any other).

For the 8x, i use latest 8.2 (3). Sorry for confusion, i followed myself wrong enumeration principle

For the 8.2.3 i use the non-threadened currently. I did also call in the 8.2.3 the deprecated ippiCrossCorrValid_NormLevel_8u32f_C1R function, but this behaved pretty much like the  ippiCrossCorrNorm_8u32f_C1R , so i suppose it is routed to the later call internally?

I´ll try the threadened 8.2.3.

But for real applications it would be good to control the behaviour dynamically. Sometimes you have multiple objects using this functions, say 10 different marks in Image,, so I could use omp in my application on higher Level. Then the non-threadened makes sense. At the same time maybe the application uses this function in a different Task only for one object (e.g. Quality check in Image), so now one would like to use the threadened/ parallel Version in the ipp. How would you deal with this? Is it possible to use threadened Version and control the number of threads in the ipp, so if i use omp in my application on higher Level the ipp is not Splitting the load, while for 2nd case i could allow to use ipp to work on all cores? Is this possible?

Thank you for help

Best regards

Stefan

 

0 Kudos
Igor_A_Intel
Employee
771 Views

Stefan,

we have realized that internal threading doesn't make sense for IPP. There are a lot of reasons for this statement - some of them: external threading at the application level is much more efficient, if external threading is done with some other tool than the same Intel OMP version used by IPP, this tool (for example TBB, WinAPI, p-threads, etc.) knows nothing about OMP (and OMP knows nothing about this tool) - therefore you'll face with thread oversubscription, race for HW resources, etc.); when you sequentially call 2 threaded IPP functions that are working on the same data, first there will be a barrier between these 2 functions, second - there is no any guarantee that mapping between logical threads and HW threads will remain the same - while the first piece of data will be processed by the same logical thread #0 - in HW it can be #3 for the 1st function and #1 for the 2nd - therefore there will be an intensive data exchange between caches of different cpus.  And many other problems. This is why we marked threaded libraries as deprecated in 8.x. As regarding your question - you can link with threaded IPP library and use ippSetNumThreads(1) in the first case (before the 1st case) and ippSetNumThreads(num cpus/threads in your app) in the second.

regards, Igor

0 Kudos
Loos__Stefan
Beginner
771 Views

Hello Igor

I tested the threadened dll. I found that the new call  ippiCrossCorrNorm behaves like in the non-threaded dll, only one core is busy, Timing is poor but load is few.

the old call ippiCrossCorrValid_NormLevel behaves like with the 7.x dll, 4-5x faster but all cores are highly busy even with a sleep(100) and a theoretical busy time of only 5%  (that means the time of the call to the Loop time including the sleep is at 5%).

So the bug in the omp was obviously never fixed for the threaded dll? Though i found a thread saying it was fixed for the 7.1 ? But i don´t find a 7.1 Evaluation Version anymore.

Can you tell me how to set KMP_BLOCK_TIME global variable ? I don´t find much Information about it

Thank you

Best Regards

Stefan

0 Kudos
Igor_A_Intel
Employee
771 Views

Stefan,

please refer to https://software.intel.com/en-us/tags/21185/all

KMP_BLOCKTIME

High CPU usage and Intel® IPP threaded function

The article describes serveral possible scenarios and solution for high cpu usage when call IPP function in real application

For who still hope to keep IPP build-in thread functionatlity, you may change the IPP OpenMP thread execution mode by 
setting environment variable in system, KMP_BLOCKTIME=0 
or setting it before run your application.
>set KMP_BLOCKTIME=0
>run.bat
or call OpenMP function kmp_set_blocktime(0). 

Notes*: The function kmp_set_blocktime() is from Intel OpenMP* run-time library libiomp*.lib/dll.

regards, Igor

0 Kudos
Loos__Stefan
Beginner
771 Views

Hello

Setting the variable as normal Windows System evivronment variable worked for both 7.xx and 8.xx threaded 32bit., 

It turned out that the bug was fixed for the 64bit threaded dll in Version 8.2, but not in the 32bit threaded. So for the 64bit Setting the variable is not required

Thank you for help

Best regards

Stefan

0 Kudos
Reply