Performance ippiCrossCorrNorm

Tom_B_1 · ‎07-16-2014

Since the function ippiCrossCorrValid_NormLevel_32f_C1R is marked as deprecated, we switched to the new function ippiCrossCorrNorm_32f_C1R. Using this new function, we see a slowdown by a factor 7 between these 2 function calls in IPP8.1 We use a srcRoiSize of 21x21 pixels and a tplRoiSize of 17x17 pixels. The used algorithm type definition for ippiCrossCorrNorm is: IppEnum funCfg = (IppEnum)(ippAlgAuto | ippiROIValid | ippiNormCoefficient); We also noticed that ippiCrossCorrValid_NormLevel_32f_C1R for the given src and tpl sizes is significant faster than the alternatives ippiCrossCorrValid_Norm_32f_C1R and ippiCrossCorrValid_32f_C1R which are mathematically simpler image proximity measures. Any suggestion how to reach the performance of ippiCrossCorrValid_NormLevel_32f_C1R with the new function ippiCrossCorrNorm? Best regards,

Ying_H_Intel · ‎07-16-2014

Hi Tom,

Thanks for the reports. We need more information for further investigation, like

1. how you link ipp library on 32bit or intel64bit, windows or linux etc.

2 how the performance measure (repeated or only one time)?

If possible, could you please attach one small test case to show the problem.

Best Regards,

Ying

Tom_B_1 · ‎07-16-2014

Hi Ying,

We use Windows and we link the IPP library both 32 bit and 64 bit, both static and dynamic.
All these configuration result in the same behavior.

For example using the test code fragment below, we see a performance drop by a factor 7 using the ippCrossNorm function

// Initialize inputs
long rad_src = 10;
long rad_tpl = 8;

long len_src = 2*rad_src+1;
long len_tpl = 2*rad_tpl+1;
long len_roi = 2*(rad_src-rad_tpl)+1;

float* corr = new float[len_roi*len_roi];
IppiSize roi_corr = {len_roi,len_roi};
IppiSize roi_tpl = {len_tpl, len_tpl};
IppiSize roi_src = {len_src, len_src};

float* pSrc = new float[len_src*len_src];
float* pTpl = new float[len_tpl*len_tpl];
unsigned int seed = 27;
ippsRandUniform_Direct_32f(pSrc, len_src*len_src,0.0f,1.0f,&seed);
seed = 31;
ippsRandUniform_Direct_32f(pTpl, len_tpl*len_tpl,0.0f,1.0f,&seed);

// Create filter buffer
IppEnum funCfg = (IppEnum)(ippAlgAuto|ippiROIValid|ippiNormCoefficient);
Ipp8u *pBuffer;
int bufSize;

ippiCrossCorrNormGetBufferSize(roi_src, roi_tpl, funCfg, &bufSize);
pBuffer = ippsMalloc_8u(bufSize);

// Loop 100000 times
for(long k = 0 ; k < 100000; k++)
{

    //deprecated but 7x times faster than the new implemenation
    /*ippiCrossCorrValid_NormLevel_32f_C1R(
         (const Ipp32f*)pSrc, 4*len_src, roi_src,
         (const Ipp32f*) pTpl, 4*len_tpl, roi_tpl,
         (Ipp32f*)corr, 4*len_roi);*/

    ippiCrossCorrNorm_32f_C1R(
         (const Ipp32f*)pSrc, 4*len_src, roi_src,
         (const Ipp32f*) pTpl, 4*len_tpl, roi_tpl,
         (Ipp32f*)corr, 4*len_roi,funCfg,pBuffer);

}

ippsFree( pBuffer );

delete [] pSrc;
delete [] pTpl;
delete [] corr;

Igor_A_Intel · ‎07-17-2014

Hi Tom,

thank you for reporting this performance bug. As a workaround please use "old" deprecated function till the next IPP update (IPP 8.2 is already frozen, so it will the next one after 8.2, where this bug will be fixed) . This function uses 2 methods (depends on workload sizes) - direct and based on convolution theorem, so the direct method has not been ported yet (it's more efficient for small (as in your case) workloads).

regards, Igor

Tom_B_1 · ‎07-17-2014

Hi,

Thanks for the quick reply. I will continue to use the deprecated function in 32-bit mode.
However, I ran into floating point overflow issues using ippiCrossCorrValid_NormLevel in 64-bit mode.
I will further investigate this 64-bit issue and try to post a test case.

Tom

Tom_B_1 · ‎07-18-2014

In 64-bit mode, the deprecated function call ippiCrossCorrValid_NormLevel is accessing memory behind the specified src roi size.
More particular in my test case of a srcroi = [21,21] pixels, ippiCrossCorrValid_NormLevel is accessing data up to a region of [32,32].
Using a smaller srcroi of e.g. [15,15] result in accessing the data in a region of [16,16] pixels.
Depending on the data present in these 'invalid' regions, a floating point overflow can be thrown.
This only seems to occur in 64-bit mode and is not an issue in 32-bit mode.
Is this known behavior?

Best regards,

Ying_H_Intel · ‎07-27-2014

Hi Tom,

Thank you for sharing.

You metioned, More particular in my test case of a srcroi = [21,21] pixels, ippiCrossCorrValid_NormLevel is accessing data up to a region of [32,32].

Using a smaller srcroi of e.g. [15,15] result in accessing the data in a region of [16,16] pixels.

Depending on the data present in these 'invalid' regions, a floating point overflow can be thrown.

I try your example code, the srcroi = [21, 21] ippiCrossCorrValid_NormLevel and open the floating point overflow check , i try several times, haven't see the error. Do you have a way to reproduce the problem?

Thanks

Ying

#include "ipp.h"

#include <iostream>
#include <iomanip>

typedef unsigned char byte;

int main(int argc, char* argv[])
{
	ippInit();
	const IppLibraryVersion* lib = ippsGetLibVersion();
	printf("%s %s %d.%d.%d.%d\n", lib->Name, lib->Version, lib->major, lib->minor, lib->majorBuild, lib->build);


	long rad_src = 10;

	long rad_tpl = 8;

	long len_src = 2 * rad_src + 1;

	long len_tpl = 2 * rad_tpl + 1;

	long len_roi = 2 * (rad_src - rad_tpl) + 1;

	float* corr = new float[len_roi*len_roi];

	IppiSize roi_corr = { len_roi, len_roi };

	IppiSize roi_tpl = { len_tpl, len_tpl };

	IppiSize roi_src = { len_src, len_src };

	float* pSrc = new float[len_src*len_src];

	float* pTpl = new float[len_tpl*len_tpl];

	unsigned int seed = 27;

	ippsRandUniform_Direct_32f(pSrc, len_src*len_src, 0.0f, 1.0f, &seed);

	seed = 31;

	ippsRandUniform_Direct_32f(pTpl, len_tpl*len_tpl, 0.0f, 1.0f, &seed);

	// Create filter buffer

	IppEnum funCfg = (IppEnum)(ippAlgAuto | ippiROIValid | ippiNormCoefficient);

	Ipp8u *pBuffer;

	int bufSize;

	ippiCrossCorrNormGetBufferSize(roi_src, roi_tpl, funCfg, &bufSize);

	pBuffer = ippsMalloc_8u(bufSize);

	// Loop 100000 times

//	for (long k = 0; k < 100000; k++)

//	{

		//deprecated but 7x times faster than the new implemenation

	IppStatus status;

	status= ippiCrossCorrValid_NormLevel_32f_C1R(

		(const Ipp32f*)pSrc, 4*len_src, roi_src,

		(const Ipp32f*) pTpl, 4*len_tpl, roi_tpl,

		(Ipp32f*)corr, 4*len_roi);
	printf("%s\n", ippGetStatusString(status));

	/*	ippiCrossCorrNorm_32f_C1R(

			(const Ipp32f*)pSrc, 4 * len_src, roi_src,

			(const Ipp32f*)pTpl, 4 * len_tpl, roi_tpl,

			(Ipp32f*)corr, 4 * len_roi, funCfg, pBuffer);
			*/

	//}

	ippsFree(pBuffer);

	delete[] pSrc;

	delete[] pTpl;

	delete[] corr;
}

Tom_B_1 · ‎07-28-2014

Hi,

If you try to run the example below you will get an acces violation in 64-bit and not in 32-bit.
The code allocates a matrix of 512x512 points.
The acces violation occurs when accessing line 480 in the matrix.
That is why I think that IPP is internally using a 32x32 kernel instead of 21x21 kernel

long rad_src = 10;
   long rad_tpl = 8;

   long len_src = 2*rad_src+1;
   long len_tpl = 2*rad_tpl+1;

   long len_roi = 2*(rad_src-rad_tpl)+1;
   float* corr = new float[len_roi*len_roi];
   IppiSize roi_corr = {len_roi,len_roi};

   IppiSize roi_tpl = {len_tpl, len_tpl};
   IppiSize roi_src = {len_src, len_src};
   
   int np = 512;
   float* pSrc = new float[np*np];
   float* pTpl = new float[np*np];
   
   unsigned int seed = 0;
   
   for(long k = 0 ; k < 1; k++)
   {
    seed = k;
    ippsRandUniform_Direct_32f(pSrc, np*np,0.0f,1.0f,&seed);
    seed = k+3;
    ippsRandUniform_Direct_32f(pTpl, np*np,0.0f,1.0f,&seed);
    

    for(int j = 0 ; j < np-len_src; j++)
    {
     for(int i = 0; i < np-len_src; i++)
     {
      
      //deprecated but 7x times faster than the new implemenation
      ippiCrossCorrValid_NormLevel_32f_C1R(
      (const Ipp32f*)(pSrc+j*np+i), 4*np, roi_src,
      (const Ipp32f*)(pTpl+(j+2)*np+i+2), 4*np, roi_tpl,
      (Ipp32f*)corr, 4*len_roi);
     }
    }

   }
   
   delete [] pSrc;
   delete [] pTpl;
   delete [] corr;

Ying_H_Intel · ‎07-28-2014

Hi Tom,

Thanks a lot. We can reproduce the problem. It seems a bug in x64 asm for ippiCrossCorr, we will investigate it and keep you updates.

Thanks
Ying