Solved: Resize with Cubic in IPP 2019 slower than IPP 5.2

Liu__Ning · ‎08-28-2019

Hello,

currently we are using IPP 5.2 in our application, I try to replace it with IPP 2019 with Nuget package. I don't understand the performance comparison of resize with CUBIC between IPP 5.2 and IPP 2019.

The resize test is that the size of the destination image is (240, 217), one part of the source image will be zoomed to the destination's size.
When one image (60 * 54) is zoomed 4 times, the resize cubic function of IPP 5.2 runs faster than IPP 2019.
When one image (30 * 27) is zoomed 8 times, the resize cubic function of IPP 5.2 runs still faster than IPP 2019. And in this time IPP 2019 itself is also slower than zoomed 4 times using IPP 2019.

My question is that,

Why is IPP 2019 slower than IPP 5.2?
Why is using IPP 2019 zoom 8 times slower than zoom 4 times. When zooming 8 times, the processed image size is only a quarter of the zooming 4 times?

Thank you in advance.
Ning

Andrey_B_Intel · ‎10-22-2019

Hi.

I am a bit confused. I've modified the reproducer. I see at my avx2 systems the following numbers:

Xeon Silver 4116 2.10Ghz

ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30, 27) -> ( 60, 54), 42903.60, 9273.60
( 30, 27) -> (120, 108), 137363.20, 20176.80
( 90, 81) -> ( 60, 54), 30096.00, 18941.60
( 90, 81) -> (120, 108), 78498.00, 42430.40

Core i5 7300u 2.7Ghz

ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30, 27) -> ( 60, 54), 82597.00, 19693.80
( 30, 27) -> (120, 108), 474363.80, 57542.40
( 90, 81) -> ( 60, 54), 62416.60, 56303.40
( 90, 81) -> (120, 108), 106889.40, 50775.00

Ning, could you please build this reproducer as separated application and send me output from it?

Thanks.

View solution in original post

Gennady_F_Intel · ‎08-29-2019

What is the CPU type you are running on?

Could you print ippiGetLibVersion() output?

IppLibraryVersion* lib = ippiGetLibVersion();
("\t\t version of IPP is: %s %s %d.%d.%d.%d\n", lib->Name, lib->Version, lib->major, lib->minor, lib->majorBuild, lib->build);

Gennady_F_Intel · ‎08-29-2019

and what is exact ippiResizeCubic_<mod> do you use?

Liu__Ning · ‎09-01-2019

Gennady F. (Blackbelt) wrote:
and what is exact ippiResizeCubic_<mod> do you use?

Hello Gennady,

thank you for your reply, my CPU is Intel Core i7-8700k, the cubic method is ippiResizeCubic_16u_C1R in my project.

The result of "ippiGetLibVersion" is

name : 0x3b834220 "ippIP AVX2 (h9)"

Version : 0x3b834230 "2019.0.4 (r62443)"

major : 2019

minor : 0

majorBuild : 4

build : 62443

Thank you again and looking forward to your reply.

Kind regards,

Ning

Pavel_B_Intel1 · ‎09-02-2019

Hello Ning,

thanks, we will investigate it. It will take some time.

Pavel

Gennady_F_Intel · ‎09-02-2019

Ning, could you give us the same output when you linked with 5.2 version?

Liu__Ning · ‎09-03-2019

Pavel Berdnikov (Intel) wrote:
Hello Ning,
thanks, we will investigate it. It will take some time.
Pavel

Thank you Pavel

Liu__Ning · ‎09-03-2019

Gennady F. (Blackbelt) wrote:
Ning, could you give us the same output when you linked with 5.2 version?

Hello Gennady,

The version information is

Name : 0x3BBBF2A8 "ippip8-6.0.dll+"

Version : 0x3BBBF280 "6.0 Update 2 build 167.41"

major : 6

minor: 0

majorBuild : 167

build : 692

targetCpu: p8

Furthermore, the target CPU of IPP 2019 is h9.

Regards,

Ning

Gennady_F_Intel · ‎09-03-2019

Ning, We could not see the problem on our side, could you give us the reproducer which we could build and run on our side?

Liu__Ning · ‎09-04-2019

Gennady F. (Blackbelt) wrote:
Ning, We could not see the problem on our side, could you give us the reproducer which we could build and run on our side?

Hi Gennady, the ipp is integrated into our application, it is a little bit hard to extract it as a simple reproducer.

In our application, the IPL project which is also from Intel is still used as a bridge between IPP and our application. The IPL project works only with previous version IPP (like IPP 5.2), so for resize I must replace the old ipp function with new implementation.

old IPP resize function in IPL project

ippiResize_16u_C1R((Ipp16u*)pSrc, srcSize, src->widthStep, srcRoi,
(Ipp16u*)pDst, dst->widthStep, dstRoiSize, xFactor, yFactor, interpolation);

new implementation with Cubic interpolation type

IppiResizeSpec_32f* pSpec = 0;
int specSize = 0, initSize = 0, bufSize = 0;
Ipp16u borderValue = 0;
Ipp8u* pBuffer = 0;
Ipp8u* pInitBuf = 0;
IppiPoint dstOffset = { 0, 0 };
Ipp8u *pSrc, *pDst;
IppiSize srcSize, dstRoiSize;
double CubicParameterB = 0.15f;
double CubicParameterC = 0.5f;

ippiResizeGetSize_16u(srcSize, dstRoiSize, ippCubic, 0, &specSize, &initSize);
pInitBuf = ippsMalloc_8u(initSize);
pSpec = (IppiResizeSpec_32f*)ippsMalloc_8u(specSize);
ippiResizeCubicInit_16u(srcSize, dstRoiSize, CubicParameterB, CubicParameterC, pSpec, pInitBuf);
ippiResizeGetBufferSize_8u(pSpec, dstRoiSize,1, &bufSize);
pBuffer = ippsMalloc_8u(bufSize);
ippiResizeCubic_16u_C1R((Ipp16u*)pSrc, src->widthStep, (Ipp16u*)pDst, dst->widthStep, dstOffset, dstRoiSize, ippBorderRepl, borderValue, pSpec, pBuffer); 		   

iplFree(pInitBuf);
iplFree(pSpec);
iplFree(pBuffer);

Are there anything wrong in my implementation, and this is the only difference inside my performance test. Furthermore, have you also test the performance of resize cubic between IPP 2019 and previous version IPP (before resize change in IPP 7.1)

Thank you very much!

Kind regards,

Ning

Liu__Ning · ‎09-05-2019

Hello Gennady and Pavel,

I've done another compare test for resizing with cubic, this time I kept the size of source image the same and change the resize factor. The test is still using ippiResizeCubic_16u_C1R with 1000 times repetitions, I attached three test results.

When the size of source image is (30, 27) , small image, the performance of IPP 5 is better than IPP 2019.

When the size of source image is (150, 136), the performance of IPP 5 is almost the same as IPP 2019.

When the size of source image is larger then (150, 136), like the third image with size (480, 517), the speed of IPP 2019 is faster than IPP 5.

From the test result I got, the IPP 2019 is faster when dealing with larger image, but slower when resizing smaller image. Is this because different cubic algorithm is used in the IPP 2019.

For resizing the smaller image (30,27), is the quality of resized image with IPP 2019 better than resized with IPP 5?

Thank you for your help, any suggestions are appreciated!

Kind regards,

Ning

Liu__Ning · ‎09-23-2019

Gennady F. (Blackbelt) wrote:
Ning, We could not see the problem on our side, could you give us the reproducer which we could build and run on our side?

Hi Gennady,

sorry to disturb you, may I ask that if you receive my Email with modified test code?

Thank you and kind regards,

Ning

Gennady_F_Intel · ‎09-23-2019

Hi Ning, yes, the issue with small input sizes is confirmed when the problem sizes too small ( <= ~100 ), in the case of medium and big input sizes, ipp v2019 outperforms the ipp6.0. Checking with AVX, AVX2, and AVX-512 based systems. The issue is escalated and we will keep this thread updated.

--Gennady

Liu__Ning · ‎09-24-2019

Gennady F. (Blackbelt) wrote:
Hi Ning, yes, the issue with small input sizes is confirmed when the problem sizes too small ( <= ~100 ), in the case of medium and big input sizes, ipp v2019 outperforms the ipp6.0. Checking with AVX, AVX2, and AVX-512 based systems. The issue is escalated and we will keep this thread updated.
--Gennady

Hello Gennady,

thank you for your reply. Is this because the different cubic interpolation method are applied. And could you please help me to explain what are the differences between this two cubic interpolation methods (I couldn't find much information on the IPP manuel)? Is the new cubic interpolation method has better performance? Is it possible to use old cubic interpolation method when image size is small?

Thank you again.

Kind regards,

Ning

Pavel_B_Intel1 · ‎10-08-2019

Hello Ning,

the performance degradation happened because of using more large CPU registers on AVX2 we have benefits on big-enough data, but it affects small data. We will tune the optimization for small data as it is important for you in next IPP releases. I'm sorry for this.

Could you provide any additional information from your side: why processing of such small images is important for you? what are your workloads? Is the resize operation is critical in your pipeline (how many % from whole pipeline it takes?)

Pavel

Liu__Ning · ‎10-09-2019

Pavel Berdnikov (Intel) wrote:
Hello Ning,
the performance degradation happened because of using more large CPU registers on AVX2 we have benefits on big-enough data, but it affects small data. We will tune the optimization for small data as it is important for you in next IPP releases. I'm sorry for this.
Could you provide any additional information from your side: why processing of such small images is important for you? what are your workloads? Is the resize operation is critical in your pipeline (how many % from whole pipeline it takes?)
Pavel

Hello Pavel,

thank you very much for your reply. Our product is medical image diagnostic software. Our customers are mostly doctors. One of the daily use of our software is to zoom in small series of CT images to diagnose disease. Therefore the performance of zooming is very important for our customers and also for us.
In order to provide excellence user experience of zooming, our product has to guarantee that a series of CT images should be zoomed together and smoothly by moving mouse wheel. The number of zooming operation per mouse moving could be up to 2000 times.
Although it may make not much difference if we are using the recent CPU, some of our customers are still using old PC with relative slow performance.

It would be really great if the performance of resizing small data is improved in next IPP releases.
Thank you very much for your help!

Kind regards,
Ning

Pavel_B_Intel1 · ‎10-09-2019

Hello Ning,

I understand your case, thanks. If you have any performance expectations for IPP and data sets for performance measurement and can share this data with us it will be very helpful. We can add the cases in our regular test cycle for better validation.

In any way I will contact with you as soon as we will have new results.

Pavel

Liu__Ning · ‎10-09-2019

Pavel Berdnikov (Intel) wrote:
Hello Ning,
I understand your case, thanks. If you have any performance expectations for IPP and data sets for performance measurement and can share this data with us it will be very helpful. We can add the cases in our regular test cycle for better validation.
In any way I will contact with you as soon as we will have new results.
Pavel

Hello Pavel,

last month Gennady has sent me a test benchmark example, and I've changed it to compare the performance between IPP V2019 and IPP 6 and have sent back to Gennady, The result of this test benchmark example shows the similar behavior as what we have in our application with medical image data. Would it be helpful?

I'm looking forward to your new results, thank you for all your and Gennady's help!

Kind regards,

Ning

Pavel_B_Intel1 · ‎10-09-2019

Ok, thank you. We will use these benchmark.

Pavel

Andrey_B_Intel · ‎10-21-2019

Hello, Ning.

When the input image is small the processing of border pixels affects the performance of the ippiResizeCubic_16u more than the previous function. Is it possible in your application to allocate an additional buffer and duplicate border pixels in it? IPP has the necessary API and I am attaching such workaround. I see some speedup at my AVX2 system. Could you please test at your side too?

Andrey.

Liu__Ning · ‎10-22-2019

Andrey Bakshaev (Intel) wrote:
Hello, Ning.
When the input image is small the processing of border pixels affects the performance of the ippiResizeCubic_16u more than the previous function. Is it possible in your application to allocate an additional buffer and duplicate border pixels in it? IPP has the necessary API and I am attaching such workaround. I see some speedup at my AVX2 system. Could you please test at your side too?
Andrey.

Hello Andrey,

thank you very much for your solution, I've modified you sample code inside my test benchmark with the same measure method as before, the code is shown as follow, and I have also test it with my system (AVX2), the performance is similar as before, the result is in the attachment. Have I changed something wrong? The new result in the image is in shown with label IPP 2019 with border InMem ms.

Thank you and kind regards,

Ning

double resize_bench_additional_buffer(int srcw, int srch, int dstw, int dsth)
{
	IppStatus status;
	IppiResizeSpec_32f* pSpec = 0;
	IppiInterpolationType interpolation = ippCubic;
	Ipp8u *pInitBuf, *pBuffer;
	int specSize, initSize, bufSize, srcStep, dstStep, i, j;
	IppiPoint dstOffset = { 0, 0 };
	Ipp16u valueB = 0.15;
	Ipp16u valueC = 0.5;
	Ipp16u *pSrc, *pDst;
	Ipp16u borderVal[3] = { 0,0,0 };
	IppiSize srcSize, dstSize;
	Ipp16u *pSrc1;
	int srcStep1;
	IppiSize srcSize1;

	srcSize.height = srcw;
	srcSize.width = srch;
	dstSize.height = dstw;
	dstSize.width = dsth;

	__int64 cycles[2];
	double cpe = 0;
	int n, nloops = 10000;

	//Resize with ippBorderRepl
	pSrc = ippiMalloc_16u_C1(srcSize.width, srcSize.height, &srcStep);
	pDst = ippiMalloc_16u_C1(dstSize.width, dstSize.height, &dstStep);

	for (i = 0; i < srcSize.height; i++) {
		for (j = 0; j < srcSize.width; j++) {
			pSrc[(srcStep >> 1)*i + j] = i + j;
		}
	}

	status = ippiResizeGetSize_16u(srcSize, dstSize, interpolation, 0, &specSize, &initSize);
	pInitBuf = ippsMalloc_8u(initSize);
	pSpec = (IppiResizeSpec_32f*)ippsMalloc_8u(specSize);
	status = ippiResizeCubicInit_16u(srcSize, dstSize, valueB, valueC, pSpec, pInitBuf);
	status = ippiResizeGetBufferSize_16u(pSpec, dstSize, 1, &bufSize);
	pBuffer = ippsMalloc_8u(bufSize);


	IppiBorderSize borderSize;
	status = ippiResizeGetBorderSize_16u(pSpec, &borderSize);
	srcSize1.width = borderSize.borderLeft + srcSize.width + borderSize.borderRight;
	srcSize1.height = borderSize.borderTop + srcSize.height + borderSize.borderBottom;
	pSrc1 = ippiMalloc_16u_C1(srcSize1.width, srcSize1.height, &srcStep1);

	ippiCopyReplicateBorder_16u_C1R(pSrc, srcStep, srcSize, pSrc1, srcStep1, srcSize1,
		borderSize.borderTop, borderSize.borderLeft);

	status = ippiResizeCubic_16u_C1R(pSrc1 + (srcStep1 >> 1)*borderSize.borderTop + borderSize.borderLeft,
		srcStep1, pDst, dstStep, dstOffset, dstSize, ippBorderInMem, borderVal, pSpec, pBuffer);
	Ipp64s t1, t2;

	t1 = ippGetCpuClocks();
	for (n = 0; n < NIMAGES; n++) {
		status = ippiResizeCubic_16u_C1R(pSrc1 + (srcStep1 >> 1)*borderSize.borderTop + borderSize.borderLeft,
			srcStep1, pDst, dstStep, dstOffset, dstSize, ippBorderInMem, borderVal, pSpec, pBuffer);
	}
	t2 = ippGetCpuClocks();
	double execTime = (double)(t2 - t1);
	int Mhz = 0;
	ippGetCpuFreqMhz(&Mhz);
	execTime = execTime / (1.e6*(double)Mhz);
	printf("... IPP2019 ippiResizeCubic_16u_C1R with border InMem ExecTime  ==  %lf sec, Src Image %d x %d, Dst Image %d x %d ... \n\n", execTime,
		srcSize.width, srcSize.height, dstSize.width, dstSize.height);


	return execTime;
}