- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
i have the following performance issues with CrossCorrNorm function.
The first issue is with different sizes of the src and template compared to the "old" version of the algorithm.
If i use as source size 512, 512 and as template 256,256 i have a calculation time of 3 ms with the new version, old version 3.8 ms
If i use as source size 512, 512 and as template 500,500 i have a calculation time of 13 ms with the new version, old version 3.3 ms.
Where this is coming from?
The second issue i have with both version of the algorithm
If i use as source size 512, 512 and as template 256,256 i have a calculation time of 3 ms with the new version, old version 3.8 ms
If i use as source size 513, 512 and as template 256,256 i have a calculation time of 5.5 ms with the new version, old version 6,55ms
The calculation time is nearly doubled. What is happening here? This happening also at 256 to 257 and 1024 to 1025. Always doubles the calculation time.
Now the source code how i tested this behavior.
void testCrossCorr(){
Timer timer;
IppStatus status;
IppiSize srcRoiSize = { 513,512 };
IppiSize tplRoiSize = { 500,500 };
IppiSize dstRoiSize = { srcRoiSize.width - tplRoiSize.width + 1, srcRoiSize.height - tplRoiSize.height + 1 };
int stepBytesSrc = 0;
Ipp8u* pSrc = ippiMalloc_8u_C1(srcRoiSize.width, srcRoiSize.height, &stepBytesSrc);
int stepBytesTpl = 0;
Ipp8u* pTpl = ippiMalloc_8u_C1(tplRoiSize.width, tplRoiSize.height, &stepBytesTpl);
int stepBytesDst = 0;
Ipp32f* pDst = ippiMalloc_32f_C1(dstRoiSize.width, dstRoiSize.height, &stepBytesDst);
IppEnum funCfg = (IppEnum)(ippAlgAuto | ippiROIValid | ippiNorm);
Ipp8u *pBuffer;
int bufSize;
status = ippiCrossCorrNormGetBufferSize(srcRoiSize, tplRoiSize, funCfg, &bufSize);
if (status != ippStsNoErr) return;
pBuffer = ippsMalloc_8u(bufSize);
timer.start();
int loopSize = 10;
for (int i = 0; i < loopSize; i++)
{
status = ippiCrossCorrNorm_8u32f_C1R(pSrc, stepBytesSrc, srcRoiSize, pTpl, stepBytesTpl, tplRoiSize, pDst,
stepBytesDst, funCfg, pBuffer);
}
{
timer.stop();
std::cout << "\n::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted " <<
timer.elapsed_time() << "ms\n" <<
timer.start();
}
for (int i = 0; i < loopSize; i++)
{
status = ippiCrossCorrValid_NormLevel_8u32f_C1R(pSrc, stepBytesSrc, srcRoiSize,
pTpl, stepBytesTpl, tplRoiSize, pDst, stepBytesDst);
}
{
timer.stop();
std::cout << "\n::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted " <<
timer.elapsed_time() << "ms\n" <<
timer.start();
}
ippiFree(pSrc);
ippiFree(pTpl);
ippiFree(pDst);
ippsFree(pBuffer);
}
Regards
Herb
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Herbert,
This functionality is optimized with convolution theorem - therefore if some of dimensions crosses the next boundary of pow of 2 - the next order FFT is used, that, obviously, increases time ~2x.
As regarding perf differences between different versions of IPP - please provide an output from lib versions for both:
const IppLibraryVersion *lib;
lib = ippiGetLibVersion();
printf( "CPU : %s\n", lib->targetCpu );
printf( "Name : %s\n", lib->Name );
printf( "Version : %s\n", lib->Version );
printf( "Build date: %s\n", lib->BuildDate );
regards, Igor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Igor,
thanks for the fast reply. Normally i have a linear processing time depending on the source size and template size. When the calculation time is doubled on specific boundary the behavior is bad. Maybe a internal clustering would bring a better scaling result.
The Ipp9 library is:
targetCPU: I9
Name: ippIP AVX2 (I9 threaded) --> i use this with setNumThreads(1) and also not the mt version of the library
Version: 9.0 Legacy (r48491) (-)
BuildDate: Oct 13 2015
Version Ipp18
targetCPU: I9
Name: ippIP AVX2 (I9) --> i use this with setNumThreads(1) and also not the mt version of the library
Version: 2018.0.3 (r58644)
BuildDate: Apr 7 2018
regards, Herb
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ok, got it, thank you.
also please tell me what is your operating system - Windows or Linux?
regards, Igor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
the behavior happens on Windows and Linux. We develop on Windows and our target system is linux.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Herb,
I don't see any issues: (the same versions of libraries, measured on my T470 laptop (the same l9 code version):
static linking:
::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted 227858.505917 cpe
::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted 304642.631953 cpe
Press any key to continue . . .
dynamic linking:
::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted 232007.209467 cpe
::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted 276802.594083 cpe
Press any key to continue . . .
threaded dynamic libs, numThreads = 1:
::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted 51539.084615 cpe
::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted 58517.840237 cpe
Press any key to continue . . .
I slightly modified your code - see below:
#include <stdio.h>
#include "ippdefs.h"
#include "ipp.h"
#include "ippdefs90legacy.h"
#include "ippi90legacy.h"
int main(void)
{
IppStatus status;
IppiSize srcRoiSize = { 512,512 };
IppiSize tplRoiSize = { 500,500 };
IppiSize dstRoiSize = { srcRoiSize.width - tplRoiSize.width + 1, srcRoiSize.height - tplRoiSize.height + 1 };
int stepBytesSrc = 0;
Ipp8u* pSrc = ippiMalloc_8u_C1(srcRoiSize.width, srcRoiSize.height, &stepBytesSrc);
int stepBytesTpl = 0;
Ipp8u* pTpl = ippiMalloc_8u_C1(tplRoiSize.width, tplRoiSize.height, &stepBytesTpl);
int stepBytesDst = 0;
Ipp32f* pDst = ippiMalloc_32f_C1(dstRoiSize.width, dstRoiSize.height, &stepBytesDst);
ippiImageJaehne_8u_C1R(pSrc, stepBytesSrc, srcRoiSize);
ippiImageJaehne_8u_C1R(pTpl, stepBytesTpl, tplRoiSize);
IppEnum funCfg = (IppEnum)(ippAlgAuto | ippiROIValid | ippiNorm);
Ipp8u *pBuffer;
int bufSize;
status = ippiCrossCorrNormGetBufferSize(srcRoiSize, tplRoiSize, funCfg, &bufSize);
if (status != ippStsNoErr) return;
pBuffer = ippsMalloc_8u(bufSize);
Ipp64u strt, stp;
ippSetNumThreads(1);
int loopSize = 10;
status = ippiCrossCorrNorm_8u32f_C1R(pSrc, stepBytesSrc, srcRoiSize, pTpl, stepBytesTpl, tplRoiSize, pDst,
stepBytesDst, funCfg, pBuffer);
int nT;
ippGetNumThreads(&nT);
//printf("/nNum threads = %d\n", nT);
strt = ippGetCpuClocks();
for (int i = 0; i < loopSize; i++)
{
status = ippiCrossCorrNorm_8u32f_C1R(pSrc, stepBytesSrc, srcRoiSize, pTpl, stepBytesTpl, tplRoiSize, pDst,
stepBytesDst, funCfg, pBuffer);
}
stp = ippGetCpuClocks();
Ipp64f tmp = (Ipp64f)(stp - strt);
tmp = tmp/loopSize;
tmp = tmp/dstRoiSize.width;
tmp = tmp/dstRoiSize.height;
printf( "\n::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted %f cpe\n", tmp);
strt = ippGetCpuClocks();
for (int i = 0; i < loopSize; i++)
{
status = ippiCrossCorrValid_NormLevel_8u32f_C1R(pSrc, stepBytesSrc, srcRoiSize,
pTpl, stepBytesTpl, tplRoiSize, pDst, stepBytesDst);
}
stp = ippGetCpuClocks();
tmp = (Ipp64f)(stp - strt);
tmp = tmp/loopSize;
tmp = tmp/dstRoiSize.width;
tmp = tmp/dstRoiSize.height;
printf( "\n::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted %f cpe\n", tmp);
ippiFree(pSrc);
ippiFree(pTpl);
ippiFree(pDst);
ippsFree(pBuffer);
}
regards, Igor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Igor,
thank you for the fast testing. I used your code and also used static linking. The results are:
::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted 224359.924852 cpe
::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted 287181.53017 cpe
Also no performance issues. Now i installed also the threaded library on Ipp18 and get this result.
::testTemplateMatch() ippiCrossCorrNorm_8u32f_C1R lasted 59510.233136 cpe
::testTemplateMatch() ippiCrossCorrValid_NormLevel_8u32f_C1R lasted 50513.827219 cpe
also no performance problem. But i dont understand where this is coming from? We both used ippSetNumThreads(1), therefore where we get this performance boost of time 4x? I also checked if it runs on more then 1 cpu. It does not. I have simply no explanation for this behavior. Can you help?
Regards, Herb
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Igor,
you have an update on this topic? I looked into the list of threaded functions. The CrossCorr function is not listed. Don´t understand the performance boost in this function depending on the library used. Also the "threaded" libary is no longer available under linux.
Regards, Herb
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Herb,
To this moment I don't have the final update. I performed some investigations and found several interesting things in the threaded libs behavior, that I can't explain just now. Therefore this is in progress. As regarding your statement that "threaded" libs are not longer available under Linux - which IPP version do you mean? - All IPP releases have threaded libs for Linux as well as for Windows.
regards, Igor.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We used the IPP2019 Update 2. My colleagues said the was no rpm for the threaded version. In Windows was the option avaiable.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My colleagues found the threaded version. Sorry for the inconvienence.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Herb,
The issue with 4x performance gap between sequential and threaded (launched with ippSetNumThreads(1)) versions of ippiCrossCorrNorm_8u32f_C1R was resolved. The root cause was difference in algorithm’s parameters for threaded and non-threaded modes.
The fix will be available in IPP 2020.
The fix for ippiCrossCorrValid_NormLevel_8u32f_C1R (legacy library) is still in progress.
BR, Artem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We will update this thread as soon as the fix of the problem will be available.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page