topic Re: ippiScaleC_32s32f_C1R is slower than simple loop in Intel® Integrated Performance Primitives

ippiScaleC_32s32f_C1R is slower than simple loop

andreypir — Sat, 25 Jul 2020 00:52:46 GMT

Attached is a simple console project that shows that when scaling a matrix ippiScaleC_32s32f_C1R is slower than a simple equivalent C++ code loop. In this example a column from the source matrix is scaled into an output vector. On my PC with i7-7700K the C++ loop is about 20% faster.

Is there any way to improve the ippiScaleC performance?

Re: ippiScaleC_32s32f_C1R is slower than simple loop

Gennady_F_Intel — Sat, 25 Jul 2020 02:27:02 GMT

Andrey, how could we check this case? There is no reproducer attached to this thread.

Re: ippiScaleC_32s32f_C1R is slower than simple loop

andreypir — Sat, 25 Jul 2020 03:11:11 GMT

Sorry I thought I attached the zip. Here it is.

Re: ippiScaleC_32s32f_C1R is slower than simple loop

Gennady_F_Intel — Sat, 25 Jul 2020 04:19:55 GMT

ok, what version of IPP did you compare with?

Re: ippiScaleC_32s32f_C1R is slower than simple loop

andreypir — Sat, 25 Jul 2020 06:00:48 GMT

2020.2.254

Re: ippiScaleC_32s32f_C1R is slower than simple loop

andreypir — Sat, 25 Jul 2020 06:02:47 GMT

ippi.dll shows File version 2020.0.2.1083

Re: ippiScaleC_32s32f_C1R is slower than simple loop

Gennady_F_Intel — Mon, 27 Jul 2020 03:06:53 GMT

Yes, it seems there is a problem on the IPP side and this function has to be more optimized. We will escalate the case.

Re: ippiScaleC_32s32f_C1R is slower than simple loop

Andrey_B_Intel — Mon, 27 Jul 2020 22:39:32 GMT

Hi Andreypir!

The IPP works better with rectangular ROIs when loads whole SIMD register.

But could you please replace in your code this

for (int n = 0; n < NTESTS_SCALE; n++) { ScaleWithIPP(Source, nColumns, Dest, nRows, Factor, Shift); }

with the next code? I see some speedup at my 64bit Skylake system.

for (int n = 0; n < NTESTS_SCALE; n++) { int dLen = 0; int phase = 0; ippsSampleDown_32f((Ipp32f*)Source, nColumns*nRows, Dest, &dLen, nColumns, &phase); IppiSize roiSize = { nRows, 1 }; ippiScaleC_32s32f_C1R((Ipp32s*)Dest, nRows * sizeof(__int32), Factor, Shift, Dest, sizeof(float), roiSize, ippAlgHintFast); }

Thanks.

Re: ippiScaleC_32s32f_C1R is slower than simple loop

andreypir — Mon, 27 Jul 2020 23:09:02 GMT

Hi Andrey,

Thank you. Yes, downsampling then scaling is substantially faster than just scaling, and somewhat faster than a loop:

ScaleWithLoop: 828
ScaleWithIPP 625

on my computer. I think I hoped for a better gain, but this will work.