Multi-threading support missing for ippiCrossCorrNorm_32f_C1R (Version 2019 Update 5)

Meng__Kevin · ‎02-04-2020

I recently made a very late upgrade from IPP 6.1 to IPP 2019.5.281. I found out that the cross correlation API has gotten significantly slower compared to IPP 6.1, from 2-3ms to 5-6ms per run. I checked the ThreadedFunctionsList.txt for the IPP version 2019.5.281 and it appears that the cross correlation API no longer has multi-threading support. This is not a matter of not having threaded libraries installed; I have tested both the single and threaded libraries. Threading actually makes the API slower, 8-9ms.

Has internal multi-threading support really been removed from the cross correlation API? If so, what is the justification? Cross-correlation is a very widely used function, so it seems like an odd decision to make.

Gennady_F_Intel · ‎02-07-2020

Kevin,

Could you give us the input parameters of ippiCrossCorrNorm_32f_C1R? Specifically, we need to know the typical srcRoiSize, dsrRoiSize and algType?

thanks

Meng__Kevin · ‎02-07-2020

Hello Gennady,

The algType used is the following: (IppEnum)(ippAlgAuto | ippiROISame | ippiNormCoefficient);

The srcRoiSize used in this use case is always width 498, height 498.

There is no dstRoiSize parameter for this function, but there is a tplRoiSize, which in this use case is width 15, height 15.

The same parameters are being used for the IPP 6.1 equivalent function, ippiCrossCorrSame_NormLevel_32f_C1R, although in IPP 6.1 there is no algType parameter since that appears to be hardcoded inside the API.

Please let me know if the above is sufficient information to debug, or if more information is needed. Thanks.

Meng__Kevin · ‎02-07-2020

Also, to clarify the runtime results I was getting from testing the cross correlation API in IPP 6.1 and 2019 Update 5:

Using single thread, IPP 6.1 and 2019 Update 5 run at the same speed of 5-6ms.

When multi-threading, in this case using 4 threads, IPP 6.1 takes 2-3ms, and 2019 Update 5 takes 8-9ms.

Gennady_F_Intel · ‎02-10-2020

thanks, Kevin.

as I have learned from ipp experts that since 9.0 legacy version of IPP, the internal OpenMP threading has been removed from these functions. Therefore you could try to use legacy90packages or submit the feature request to add ippTL implementation for ippiCrossCorrNorm.

Adriaan_van_Os · ‎02-11-2020

How is a tiled implementation ever possible for fast normalized cross-correlation ?

Regards,

Adriaan van Os

Meng__Kevin · ‎02-13-2020

Gennady F. (Blackbelt) wrote:
thanks, Kevin.
as I have learned from ipp experts that since 9.0 legacy version of IPP, the internal OpenMP threading has been removed from these functions. Therefore you could try to use legacy90packages or submit the feature request to add ippTL implementation for ippiCrossCorrNorm.

Thank you for the response, Gennady.

Can you elaborate on how the reasoning behind Intel's choice to discontinue the OpenMP threading support in the cross correlation function? We have a specific application that requires it to run fast in a linear sequence.

How can I go about submitting a feature request? And is that request able to be added in this version of IPP (2019 update 5), or will it be scoped for a later release?

Gennady_F_Intel · ‎02-13-2020

Kevin, please go to the Intel Online Service Center which is the official support channel and submit the Feature Request. If the feature would be re-implement then it, probably, would be into the next versions of IPP. the latest version is 2020.

Adriaan_van_Os · ‎02-14-2020

The removal of multi-threading support in IPP is a Never Ending Soap Story. I wonder why Intel sells multi-core processors .....

In Apple's vImage framework, you simply pass kvImageDoNotTile https://developer.apple.com/documentation/accelerate/1578976-processing_flags/kvimagedonottile?language=objc as a flag if you don't want internal multi-threading.

Sincerely,

Adriaan van Os

sloos1 · ‎10-07-2021

I also found the threaded ipp call ippiCrossCorrNorm_8u32f_C1R_T being *sometimes* slower than the call to the non-parallel version.

A colleague thinks, observing core load, that sometimes the ipp is not using multi-core. But the implementation of parallel version using only 1 core seems to be slower than the non-parallel version.

An example of very bad parallel performance is an image ROI of 400x400, and a pattern of 90x70 for example (image ROI enclosing all pattern area). The non parallel version takes 5.5ms, the parallel one 8.7ms (tbb, 2021.3)

My question to Intel: What is the threshold for multi-core or single core processing? Is it possible to query in advance ? Can´t you route, in case no multi-cores are used, just to the default call?

Regards

Stefan