CrossCorrNorm performance issues Ipp17 and Ipp9

Herbert_K_ · ‎02-19-2019

Hello,

i have the following performance issues with CrossCorrNorm function.

The first issue is with different sizes of the src and template compared to the "old" version of the algorithm.

If i use as source size 512, 512 and as template 256,256 i have a calculation time of 3 ms with the new version, old version 3.8 ms

If i use as source size 512, 512 and as template 500,500 i have a calculation time of 13 ms with the new version, old version 3.3 ms.

Where this is coming from?

The second issue i have with both version of the algorithm

If i use as source size 512, 512 and as template 256,256 i have a calculation time of 3 ms with the new version, old version 3.8 ms

If i use as source size 513, 512 and as template 256,256 i have a calculation time of 5.5 ms with the new version, old version 6,55ms

The calculation time is nearly doubled. What is happening here? This happening also at 256 to 257 and 1024 to 1025. Always doubles the calculation time.

Now the source code how i tested this behavior.

void testCrossCorr(){

Timer timer;

   IppStatus status;
   IppiSize srcRoiSize = { 513,512 };
   IppiSize tplRoiSize = { 500,500 };
   IppiSize dstRoiSize = { srcRoiSize.width - tplRoiSize.width + 1, srcRoiSize.height - tplRoiSize.height + 1 };

   int stepBytesSrc = 0;
   Ipp8u* pSrc = ippiMalloc_8u_C1(srcRoiSize.width, srcRoiSize.height, &stepBytesSrc);
   int stepBytesTpl = 0;
   Ipp8u* pTpl = ippiMalloc_8u_C1(tplRoiSize.width, tplRoiSize.height, &stepBytesTpl);
   int stepBytesDst = 0;
   Ipp32f* pDst = ippiMalloc_32f_C1(dstRoiSize.width, dstRoiSize.height, &stepBytesDst);

   IppEnum funCfg = (IppEnum)(ippAlgAuto | ippiROIValid | ippiNorm);
   Ipp8u *pBuffer;
   int bufSize;
   status = ippiCrossCorrNormGetBufferSize(srcRoiSize, tplRoiSize, funCfg, &bufSize);
   if (status != ippStsNoErr) return;
   pBuffer = ippsMalloc_8u(bufSize);

   timer.start();
   int loopSize = 10;
   for (int i = 0; i < loopSize; i++)
   {
       status = ippiCrossCorrNorm_8u32f_C1R(pSrc, stepBytesSrc, srcRoiSize, pTpl, stepBytesTpl, tplRoiSize, pDst,
           stepBytesDst, funCfg, pBuffer);

   }

   {
       timer.stop();
       std::cout << "\n::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted " <<
timer.elapsed_time() << "ms\n" <<

timer.start();
   }

   for (int i = 0; i < loopSize; i++)
   {
       status = ippiCrossCorrValid_NormLevel_8u32f_C1R(pSrc, stepBytesSrc, srcRoiSize,
           pTpl, stepBytesTpl, tplRoiSize, pDst, stepBytesDst);

}

{
timer.stop();

std::cout << "\n::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted " <<
timer.elapsed_time() << "ms\n" <<

timer.start();
   }
   ippiFree(pSrc);
   ippiFree(pTpl);
   ippiFree(pDst);
   ippsFree(pBuffer);

}

Regards

Herb

Igor_A_Intel · ‎02-20-2019

Hi Herbert,

This functionality is optimized with convolution theorem - therefore if some of dimensions crosses the next boundary of pow of 2 - the next order FFT is used, that, obviously, increases time ~2x.

As regarding perf differences between different versions of IPP - please provide an output from lib versions for both:

const IppLibraryVersion *lib;

lib = ippiGetLibVersion();
printf( "CPU : %s\n", lib->targetCpu );
printf( "Name : %s\n", lib->Name );
printf( "Version : %s\n", lib->Version );
printf( "Build date: %s\n", lib->BuildDate );

regards, Igor

Herbert_K_ · ‎02-20-2019

Hi Igor,

thanks for the fast reply. Normally i have a linear processing time depending on the source size and template size. When the calculation time is doubled on specific boundary the behavior is bad. Maybe a internal clustering would bring a better scaling result.

The Ipp9 library is:

targetCPU: I9

Name: ippIP AVX2 (I9 threaded) --> i use this with setNumThreads(1) and also not the mt version of the library

Version: 9.0 Legacy (r48491) (-)

BuildDate: Oct 13 2015

Version Ipp18

targetCPU: I9

Name: ippIP AVX2 (I9) --> i use this with setNumThreads(1) and also not the mt version of the library

Version: 2018.0.3 (r58644)

BuildDate: Apr 7 2018

regards, Herb

Igor_A_Intel · ‎02-20-2019

ok, got it, thank you.

also please tell me what is your operating system - Windows or Linux?

regards, Igor

Herbert_K_ · ‎02-20-2019

the behavior happens on Windows and Linux. We develop on Windows and our target system is linux.

Igor_A_Intel · ‎02-20-2019

Hi Herb,

I don't see any issues: (the same versions of libraries, measured on my T470 laptop (the same l9 code version):

static linking:

::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted 227858.505917 cpe

::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted 304642.631953 cpe
Press any key to continue . . .

dynamic linking:

::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted 232007.209467 cpe

::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted 276802.594083 cpe
Press any key to continue . . .

threaded dynamic libs, numThreads = 1:

::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted 51539.084615 cpe

::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted 58517.840237 cpe
Press any key to continue . . .

I slightly modified your code - see below:

#include <stdio.h>
#include "ippdefs.h"
#include "ipp.h"
#include "ippdefs90legacy.h"
#include "ippi90legacy.h"

int main(void)
{

IppStatus status;
IppiSize srcRoiSize = { 512,512 };
IppiSize tplRoiSize = { 500,500 };
IppiSize dstRoiSize = { srcRoiSize.width - tplRoiSize.width + 1, srcRoiSize.height - tplRoiSize.height + 1 };

int stepBytesSrc = 0;
Ipp8u* pSrc = ippiMalloc_8u_C1(srcRoiSize.width, srcRoiSize.height, &stepBytesSrc);
int stepBytesTpl = 0;
Ipp8u* pTpl = ippiMalloc_8u_C1(tplRoiSize.width, tplRoiSize.height, &stepBytesTpl);
int stepBytesDst = 0;
Ipp32f* pDst = ippiMalloc_32f_C1(dstRoiSize.width, dstRoiSize.height, &stepBytesDst);
ippiImageJaehne_8u_C1R(pSrc, stepBytesSrc, srcRoiSize);
ippiImageJaehne_8u_C1R(pTpl, stepBytesTpl, tplRoiSize);

IppEnum funCfg = (IppEnum)(ippAlgAuto | ippiROIValid | ippiNorm);
Ipp8u *pBuffer;
int bufSize;
status = ippiCrossCorrNormGetBufferSize(srcRoiSize, tplRoiSize, funCfg, &bufSize);
if (status != ippStsNoErr) return;
pBuffer = ippsMalloc_8u(bufSize);

Ipp64u strt, stp;

ippSetNumThreads(1);

int loopSize = 10;

status = ippiCrossCorrNorm_8u32f_C1R(pSrc, stepBytesSrc, srcRoiSize, pTpl, stepBytesTpl, tplRoiSize, pDst,
stepBytesDst, funCfg, pBuffer);

int nT;
ippGetNumThreads(&nT);
//printf("/nNum threads = %d\n", nT);
strt = ippGetCpuClocks();
for (int i = 0; i < loopSize; i++)
{
status = ippiCrossCorrNorm_8u32f_C1R(pSrc, stepBytesSrc, srcRoiSize, pTpl, stepBytesTpl, tplRoiSize, pDst,
stepBytesDst, funCfg, pBuffer);

}

stp = ippGetCpuClocks();
Ipp64f tmp = (Ipp64f)(stp - strt);
tmp = tmp/loopSize;
tmp = tmp/dstRoiSize.width;
tmp = tmp/dstRoiSize.height;
printf( "\n::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted %f cpe\n", tmp);

strt = ippGetCpuClocks();
for (int i = 0; i < loopSize; i++)
{
status = ippiCrossCorrValid_NormLevel_8u32f_C1R(pSrc, stepBytesSrc, srcRoiSize,
pTpl, stepBytesTpl, tplRoiSize, pDst, stepBytesDst);

}

stp = ippGetCpuClocks();
tmp = (Ipp64f)(stp - strt);
tmp = tmp/loopSize;
tmp = tmp/dstRoiSize.width;
tmp = tmp/dstRoiSize.height;
printf( "\n::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted %f cpe\n", tmp);

ippiFree(pSrc);
ippiFree(pTpl);
ippiFree(pDst);
ippsFree(pBuffer);
}

regards, Igor

Herbert_K_ · ‎02-21-2019

Hi Igor,

thank you for the fast testing. I used your code and also used static linking. The results are:

::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted 224359.924852 cpe
::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted 287181.53017 cpe

Also no performance issues. Now i installed also the threaded library on Ipp18 and get this result.

::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted 59510.233136 cpe
::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted 50513.827219 cpe

also no performance problem. But i dont understand where this is coming from? We both used ippSetNumThreads(1), therefore where we get this performance boost of time 4x? I also checked if it runs on more then 1 cpu. It does not. I have simply no explanation for this behavior. Can you help?

Regards, Herb

Herbert_K_ · ‎02-24-2019

Hi Igor,

you have an update on this topic? I looked into the list of threaded functions. The CrossCorr function is not listed. Don´t understand the performance boost in this function depending on the library used. Also the "threaded" libary is no longer available under linux.

Regards, Herb

Igor_A_Intel · ‎02-25-2019

Hi Herb,

To this moment I don't have the final update. I performed some investigations and found several interesting things in the threaded libs behavior, that I can't explain just now. Therefore this is in progress. As regarding your statement that "threaded" libs are not longer available under Linux - which IPP version do you mean? - All IPP releases have threaded libs for Linux as well as for Windows.

regards, Igor.

Herbert_K_ · ‎02-25-2019

We used the IPP2019 Update 2. My colleagues said the was no rpm for the threaded version. In Windows was the option avaiable.

Herbert_K_ · ‎02-25-2019

My colleagues found the threaded version. Sorry for the inconvienence.

ArtemMaklaev · ‎03-27-2019

Hi Herb,

The issue with 4x performance gap between sequential and threaded (launched with ippSetNumThreads(1)) versions of ippiCrossCorrNorm_8u32f_C1R was resolved. The root cause was difference in algorithm’s parameters for threaded and non-threaded modes.

The fix will be available in IPP 2020.

The fix for ippiCrossCorrValid_NormLevel_8u32f_C1R (legacy library) is still in progress.

BR, Artem.

Gennady_F_Intel · ‎03-28-2019

We will update this thread as soon as the fix of the problem will be available.