Solved: Hi Ning. - Page 2

Liu__Ning · ‎08-28-2019

Hello,

currently we are using IPP 5.2 in our application, I try to replace it with IPP 2019 with Nuget package. I don't understand the performance comparison of resize with CUBIC between IPP 5.2 and IPP 2019.

The resize test is that the size of the destination image is (240, 217), one part of the source image will be zoomed to the destination's size.
When one image (60 * 54) is zoomed 4 times, the resize cubic function of IPP 5.2 runs faster than IPP 2019.
When one image (30 * 27) is zoomed 8 times, the resize cubic function of IPP 5.2 runs still faster than IPP 2019. And in this time IPP 2019 itself is also slower than zoomed 4 times using IPP 2019.

My question is that,

Why is IPP 2019 slower than IPP 5.2?
Why is using IPP 2019 zoom 8 times slower than zoom 4 times. When zooming 8 times, the processed image size is only a quarter of the zooming 4 times?

Thank you in advance.
Ning

Andrey_B_Intel · ‎10-22-2019

Hi.

I am a bit confused. I've modified the reproducer. I see at my avx2 systems the following numbers:

Xeon Silver 4116 2.10Ghz

ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30, 27) -> ( 60, 54), 42903.60, 9273.60
( 30, 27) -> (120, 108), 137363.20, 20176.80
( 90, 81) -> ( 60, 54), 30096.00, 18941.60
( 90, 81) -> (120, 108), 78498.00, 42430.40

Core i5 7300u 2.7Ghz

ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30, 27) -> ( 60, 54), 82597.00, 19693.80
( 30, 27) -> (120, 108), 474363.80, 57542.40
( 90, 81) -> ( 60, 54), 62416.60, 56303.40
( 90, 81) -> (120, 108), 106889.40, 50775.00

Ning, could you please build this reproducer as separated application and send me output from it?

Thanks.

View solution in original post

Andrey_B_Intel · ‎10-22-2019

Hi.

I am a bit confused. I've modified the reproducer. I see at my avx2 systems the following numbers:

Xeon Silver 4116 2.10Ghz

ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30, 27) -> ( 60, 54), 42903.60, 9273.60
( 30, 27) -> (120, 108), 137363.20, 20176.80
( 90, 81) -> ( 60, 54), 30096.00, 18941.60
( 90, 81) -> (120, 108), 78498.00, 42430.40

Core i5 7300u 2.7Ghz

ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30, 27) -> ( 60, 54), 82597.00, 19693.80
( 30, 27) -> (120, 108), 474363.80, 57542.40
( 90, 81) -> ( 60, 54), 62416.60, 56303.40
( 90, 81) -> (120, 108), 106889.40, 50775.00

Ning, could you please build this reproducer as separated application and send me output from it?

Thanks.

Liu__Ning · ‎10-22-2019

Andrey Bakshaev (Intel) wrote:
Hi.
I am a bit confused. I've modified the reproducer. I see at my avx2 systems the following numbers:
Xeon Silver 4116 2.10Ghz
ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30, 27) -> ( 60, 54), 42903.60, 9273.60
( 30, 27) -> (120, 108), 137363.20, 20176.80
( 90, 81) -> ( 60, 54), 30096.00, 18941.60
( 90, 81) -> (120, 108), 78498.00, 42430.40
Core i5 7300u 2.7Ghz
ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30, 27) -> ( 60, 54), 82597.00, 19693.80
( 30, 27) -> (120, 108), 474363.80, 57542.40
( 90, 81) -> ( 60, 54), 62416.60, 56303.40
( 90, 81) -> (120, 108), 106889.40, 50775.00
Ning, could you please build this reproducer as separated application and send me output from it?
Thanks.

Hi Andrey,

here is the output from my computer, Thank you!

ippIP AVX2 (h9), 2019.0.4 (r62443)
( 30, 27) -> ( 60, 54),     35463.60,      8669.20
( 30, 27) -> (120, 108),    122966.40,     18862.40
( 90, 81) -> ( 60, 54),     27424.40,     17297.20
( 90, 81) -> (120, 108),     73044.80,     37000.80

Andrey_B_Intel · ‎10-23-2019

Hi Ning.

Well. The workaround with an additional buffer for small sizes works 2x-4x faster. Could you please compare these 2 approaches in your application inside function "resize_bench_additional_buffer"?

Thanks.

Liu__Ning · ‎10-25-2019

Andrey Bakshaev (Intel) wrote:
Hi Ning.
Well. The workaround with an additional buffer for small sizes works 2x-4x faster. Could you please compare these 2 approaches in your application inside function "resize_bench_additional_buffer"?
Thanks.

Hello Andrey,

I've test your solution inside our application, this is really faster than before, even faster then IPP6. Thank you again for this solution!

I don't understand this line.

ippiResizeCubic_16u_C1R(pSrc1 + (srcStep1 >> 1)*borderSize.borderTop + borderSize.borderLeft,
                srcStep1, pDst, dst->widthStep, dstOffset, dstRoiSize, ippBorderInMem, &borderValue, pSpec, pBuffer);

this part (srcStep1 >> 1)*borderSize.borderTop + borderSize.borderLeft.

Could you please explain the meaning of this part?

Thank you in advance!

Kind regards,

Ning

Andrey_B_Intel · ‎10-25-2019

Hi Ning.

ippiMalloc_16u_C1 allocates image and returns step between consistent lines in &srcStep1. This function aligns data and steps on 64bytes by performance reasons. Also steps in IPP are in bytes always. So for (Ipp16u*) pointers used in workaround for calculation "full_offset" from image origin we use "offset_from_top"= (srcStep1>>1)*borderTop and "offset_from_left"=borderLeft. "full_offset" ="offset_from_top"+"offset_from_left". I.e.

(srcStep1 >> 1)*borderSize.borderTop + borderSize.borderLeft is correct.

Andrey.

Liu__Ning · ‎10-25-2019

Andrey Bakshaev (Intel) wrote:
Hi Ning.
ippiMalloc_16u_C1 allocates image and returns step between consistent lines in &srcStep1. This function aligns data and steps on 64bytes by performance reasons. Also steps in IPP are in bytes always. So for (Ipp16u*) pointers used in workaround for calculation "full_offset" from image origin we use "offset_from_top"= (srcStep1>>1)*borderTop and "offset_from_left"=borderLeft. "full_offset" ="offset_from_top"+"offset_from_left". I.e.
(srcStep1 >> 1)*borderSize.borderTop + borderSize.borderLeft is correct.
Andrey.

Hello Andrey,

thank you again for detail explanation, I still not completely understand, to calculate the "offset_from_top" we need to multiply (srcStep1>>1), and why for "offset_from_left" we don't have to multiply with it. And why scrStep1 need to divided by 2 (scrStep1>>1).

Kind regards,

Ning

Andrey_B_Intel · ‎10-25-2019

It is specific of pointer arithmetic in C. To avoid division we can convert to Ipp8u* pointers

(Ipp16u*) ((Ipp8u*)pSrc1 + srcStep1*borderSize.borderTop + borderSize.borderLeft * sizeof(Ipp16u))

Also I am attaching picture, where srcStep=14, top=2,left=2. The offset from A to B in bytes is 14*2+2*2=32 bytes (or in shorts (7*2) +2 = 16 shorts)

Andrey.

Liu__Ning · ‎10-25-2019

Andrey Bakshaev (Intel) wrote:
It is specific of pointer arithmetic in C. To avoid division we can convert to Ipp8u* pointers
(Ipp16u*) ((Ipp8u*)pSrc1 + srcStep1*borderSize.borderTop + borderSize.borderLeft * sizeof(Ipp16u))
Also I am attaching picture, where srcStep=14, top=2,left=2. The offset from A to B in bytes is 14*2+2*2=32 bytes (or in shorts (7*2) +2 = 16 shorts)
Andrey.

Hello Andrey,

Thank you very much! Now I understand.

Kind regards,

Ning

Resize with Cubic in IPP 2019 slower than IPP 5.2