- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tom,
Thanks for the reports. We need more information for further investigation, like
1. how you link ipp library on 32bit or intel64bit, windows or linux etc.
2 how the performance measure (repeated or only one time)?
If possible, could you please attach one small test case to show the problem.
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ying,
We use Windows and we link the IPP library both 32 bit and 64 bit, both static and dynamic.
All these configuration result in the same behavior.
For example using the test code fragment below, we see a performance drop by a factor 7 using the ippCrossNorm function
// Initialize inputs
long rad_src = 10;
long rad_tpl = 8;
long len_src = 2*rad_src+1;
long len_tpl = 2*rad_tpl+1;
long len_roi = 2*(rad_src-rad_tpl)+1;
float* corr = new float[len_roi*len_roi];
IppiSize roi_corr = {len_roi,len_roi};
IppiSize roi_tpl = {len_tpl, len_tpl};
IppiSize roi_src = {len_src, len_src};
float* pSrc = new float[len_src*len_src];
float* pTpl = new float[len_tpl*len_tpl];
unsigned int seed = 27;
ippsRandUniform_Direct_32f(pSrc, len_src*len_src,0.0f,1.0f,&seed);
seed = 31;
ippsRandUniform_Direct_32f(pTpl, len_tpl*len_tpl,0.0f,1.0f,&seed);
// Create filter buffer
IppEnum funCfg = (IppEnum)(ippAlgAuto|ippiROIValid|ippiNormCoefficient);
Ipp8u *pBuffer;
int bufSize;
ippiCrossCorrNormGetBufferSize(roi_src, roi_tpl, funCfg, &bufSize);
pBuffer = ippsMalloc_8u(bufSize);
// Loop 100000 times
for(long k = 0 ; k < 100000; k++)
{
//deprecated but 7x times faster than the new implemenation
/*ippiCrossCorrValid_NormLevel_32f_C1R(
(const Ipp32f*)pSrc, 4*len_src, roi_src,
(const Ipp32f*) pTpl, 4*len_tpl, roi_tpl,
(Ipp32f*)corr, 4*len_roi);*/
ippiCrossCorrNorm_32f_C1R(
(const Ipp32f*)pSrc, 4*len_src, roi_src,
(const Ipp32f*) pTpl, 4*len_tpl, roi_tpl,
(Ipp32f*)corr, 4*len_roi,funCfg,pBuffer);
}
ippsFree( pBuffer );
delete [] pSrc;
delete [] pTpl;
delete [] corr;
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tom,
thank you for reporting this performance bug. As a workaround please use "old" deprecated function till the next IPP update (IPP 8.2 is already frozen, so it will the next one after 8.2, where this bug will be fixed) . This function uses 2 methods (depends on workload sizes) - direct and based on convolution theorem, so the direct method has not been ported yet (it's more efficient for small (as in your case) workloads).
regards, Igor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for the quick reply. I will continue to use the deprecated function in 32-bit mode.
However, I ran into floating point overflow issues using ippiCrossCorrValid_NormLevel in 64-bit mode.
I will further investigate this 64-bit issue and try to post a test case.
Tom
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In 64-bit mode, the deprecated function call ippiCrossCorrValid_NormLevel is accessing memory behind the specified src roi size.
More particular in my test case of a srcroi = [21,21] pixels, ippiCrossCorrValid_NormLevel is accessing data up to a region of [32,32].
Using a smaller srcroi of e.g. [15,15] result in accessing the data in a region of [16,16] pixels.
Depending on the data present in these 'invalid' regions, a floating point overflow can be thrown.
This only seems to occur in 64-bit mode and is not an issue in 32-bit mode.
Is this known behavior?
Best regards,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tom,
Thank you for sharing.
You metioned, More particular in my test case of a srcroi = [21,21] pixels, ippiCrossCorrValid_NormLevel is accessing data up to a region of [32,32].
Using a smaller srcroi of e.g. [15,15] result in accessing the data in a region of [16,16] pixels.
Depending on the data present in these 'invalid' regions, a floating point overflow can be thrown.
I try your example code, the srcroi = [21, 21] ippiCrossCorrValid_NormLevel and open the floating point overflow check , i try several times, haven't see the error. Do you have a way to reproduce the problem?
Thanks
Ying
#include "ipp.h" #include <iostream> #include <iomanip> typedef unsigned char byte; int main(int argc, char* argv[]) { ippInit(); const IppLibraryVersion* lib = ippsGetLibVersion(); printf("%s %s %d.%d.%d.%d\n", lib->Name, lib->Version, lib->major, lib->minor, lib->majorBuild, lib->build); long rad_src = 10; long rad_tpl = 8; long len_src = 2 * rad_src + 1; long len_tpl = 2 * rad_tpl + 1; long len_roi = 2 * (rad_src - rad_tpl) + 1; float* corr = new float[len_roi*len_roi]; IppiSize roi_corr = { len_roi, len_roi }; IppiSize roi_tpl = { len_tpl, len_tpl }; IppiSize roi_src = { len_src, len_src }; float* pSrc = new float[len_src*len_src]; float* pTpl = new float[len_tpl*len_tpl]; unsigned int seed = 27; ippsRandUniform_Direct_32f(pSrc, len_src*len_src, 0.0f, 1.0f, &seed); seed = 31; ippsRandUniform_Direct_32f(pTpl, len_tpl*len_tpl, 0.0f, 1.0f, &seed); // Create filter buffer IppEnum funCfg = (IppEnum)(ippAlgAuto | ippiROIValid | ippiNormCoefficient); Ipp8u *pBuffer; int bufSize; ippiCrossCorrNormGetBufferSize(roi_src, roi_tpl, funCfg, &bufSize); pBuffer = ippsMalloc_8u(bufSize); // Loop 100000 times // for (long k = 0; k < 100000; k++) // { //deprecated but 7x times faster than the new implemenation IppStatus status; status= ippiCrossCorrValid_NormLevel_32f_C1R( (const Ipp32f*)pSrc, 4*len_src, roi_src, (const Ipp32f*) pTpl, 4*len_tpl, roi_tpl, (Ipp32f*)corr, 4*len_roi); printf("%s\n", ippGetStatusString(status)); /* ippiCrossCorrNorm_32f_C1R( (const Ipp32f*)pSrc, 4 * len_src, roi_src, (const Ipp32f*)pTpl, 4 * len_tpl, roi_tpl, (Ipp32f*)corr, 4 * len_roi, funCfg, pBuffer); */ //} ippsFree(pBuffer); delete[] pSrc; delete[] pTpl; delete[] corr; }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
If you try to run the example below you will get an acces violation in 64-bit and not in 32-bit.
The code allocates a matrix of 512x512 points.
The acces violation occurs when accessing line 480 in the matrix.
That is why I think that IPP is internally using a 32x32 kernel instead of 21x21 kernel
long rad_src = 10; long rad_tpl = 8; long len_src = 2*rad_src+1; long len_tpl = 2*rad_tpl+1; long len_roi = 2*(rad_src-rad_tpl)+1; float* corr = new float[len_roi*len_roi]; IppiSize roi_corr = {len_roi,len_roi}; IppiSize roi_tpl = {len_tpl, len_tpl}; IppiSize roi_src = {len_src, len_src}; int np = 512; float* pSrc = new float[np*np]; float* pTpl = new float[np*np]; unsigned int seed = 0; for(long k = 0 ; k < 1; k++) { seed = k; ippsRandUniform_Direct_32f(pSrc, np*np,0.0f,1.0f,&seed); seed = k+3; ippsRandUniform_Direct_32f(pTpl, np*np,0.0f,1.0f,&seed); for(int j = 0 ; j < np-len_src; j++) { for(int i = 0; i < np-len_src; i++) { //deprecated but 7x times faster than the new implemenation ippiCrossCorrValid_NormLevel_32f_C1R( (const Ipp32f*)(pSrc+j*np+i), 4*np, roi_src, (const Ipp32f*)(pTpl+(j+2)*np+i+2), 4*np, roi_tpl, (Ipp32f*)corr, 4*len_roi); } } } delete [] pSrc; delete [] pTpl; delete [] corr;
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tom,
Thanks a lot. We can reproduce the problem. It seems a bug in x64 asm for ippiCrossCorr, we will investigate it and keep you updates.
Thanks
Ying
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page