Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

Resize with Cubic in IPP 2019 slower than IPP 5.2

Liu__Ning
Beginner
4,035 Views

Hello,

currently we are using IPP 5.2 in our application, I try to replace it with IPP 2019 with Nuget package. I don't understand the performance comparison of resize with CUBIC between IPP 5.2 and IPP 2019.

The resize test is that the size of the destination image is (240, 217), one part of the source image will be zoomed to the destination's size.
When one image (60 * 54) is zoomed 4 times, the resize cubic function of IPP 5.2 runs faster than IPP 2019.
When one image (30 * 27) is zoomed 8 times, the resize cubic function of IPP 5.2 runs still faster than IPP 2019. And in this time IPP 2019 itself is also slower than zoomed 4 times using IPP 2019.

My question is that,

Why is IPP 2019 slower than IPP 5.2?
Why is using IPP 2019 zoom 8 times slower than zoom 4 times. When zooming 8 times, the processed image size is only a quarter of the zooming 4 times?

Thank you in advance.
Ning

0 Kudos
1 Solution
Andrey_B_Intel
Employee
4,109 Views

Hi.

I am a bit confused. I've modified the reproducer. I see at my avx2 systems the following numbers:

Xeon Silver 4116 2.10Ghz

ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30,  27) -> ( 60,  54),     42903.60,      9273.60
( 30,  27) -> (120, 108),    137363.20,     20176.80
( 90,  81) -> ( 60,  54),     30096.00,     18941.60
( 90,  81) -> (120, 108),     78498.00,     42430.40

Core i5 7300u 2.7Ghz

ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30,  27) -> ( 60,  54),     82597.00,     19693.80
( 30,  27) -> (120, 108),    474363.80,     57542.40
( 90,  81) -> ( 60,  54),     62416.60,     56303.40
( 90,  81) -> (120, 108),    106889.40,     50775.00

Ning, could you please build this reproducer as separated application and send me output from it?

Thanks.

 

View solution in original post

0 Kudos
28 Replies
Andrey_B_Intel
Employee
4,110 Views

Hi.

I am a bit confused. I've modified the reproducer. I see at my avx2 systems the following numbers:

Xeon Silver 4116 2.10Ghz

ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30,  27) -> ( 60,  54),     42903.60,      9273.60
( 30,  27) -> (120, 108),    137363.20,     20176.80
( 90,  81) -> ( 60,  54),     30096.00,     18941.60
( 90,  81) -> (120, 108),     78498.00,     42430.40

Core i5 7300u 2.7Ghz

ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30,  27) -> ( 60,  54),     82597.00,     19693.80
( 30,  27) -> (120, 108),    474363.80,     57542.40
( 90,  81) -> ( 60,  54),     62416.60,     56303.40
( 90,  81) -> (120, 108),    106889.40,     50775.00

Ning, could you please build this reproducer as separated application and send me output from it?

Thanks.

 

0 Kudos
Liu__Ning
Beginner
978 Views

Andrey Bakshaev (Intel) wrote:

Hi.

I am a bit confused. I've modified the reproducer. I see at my avx2 systems the following numbers:

Xeon Silver 4116 2.10Ghz

ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30,  27) -> ( 60,  54),     42903.60,      9273.60
( 30,  27) -> (120, 108),    137363.20,     20176.80
( 90,  81) -> ( 60,  54),     30096.00,     18941.60
( 90,  81) -> (120, 108),     78498.00,     42430.40

Core i5 7300u 2.7Ghz

ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30,  27) -> ( 60,  54),     82597.00,     19693.80
( 30,  27) -> (120, 108),    474363.80,     57542.40
( 90,  81) -> ( 60,  54),     62416.60,     56303.40
( 90,  81) -> (120, 108),    106889.40,     50775.00

Ning, could you please build this reproducer as separated application and send me output from it?

Thanks.

 

Hi Andrey,

here is the output from my computer, Thank you!

ippIP AVX2 (h9), 2019.0.4 (r62443)
( 30,  27) -> ( 60,  54),     35463.60,      8669.20
( 30,  27) -> (120, 108),    122966.40,     18862.40
( 90,  81) -> ( 60,  54),     27424.40,     17297.20
( 90,  81) -> (120, 108),     73044.80,     37000.80

 

0 Kudos
Andrey_B_Intel
Employee
978 Views

Hi Ning.

Well. The workaround with an additional buffer for small sizes works 2x-4x faster. Could you please compare these 2 approaches in your application inside function "resize_bench_additional_buffer"?

Thanks.

0 Kudos
Liu__Ning
Beginner
978 Views

Andrey Bakshaev (Intel) wrote:

Hi Ning.

Well. The workaround with an additional buffer for small sizes works 2x-4x faster. Could you please compare these 2 approaches in your application inside function "resize_bench_additional_buffer"?

Thanks.

Hello Andrey,

I've test your solution inside our application, this is really faster than before, even faster then IPP6. Thank you again for this solution!

I don't understand this line.

ippiResizeCubic_16u_C1R(pSrc1 + (srcStep1 >> 1)*borderSize.borderTop + borderSize.borderLeft,
                srcStep1, pDst, dst->widthStep, dstOffset, dstRoiSize, ippBorderInMem, &borderValue, pSpec, pBuffer);

this part (srcStep1 >> 1)*borderSize.borderTop + borderSize.borderLeft.

Could you please explain the meaning of this part?

Thank you in advance!

Kind regards,

Ning

0 Kudos
Andrey_B_Intel
Employee
978 Views

Hi Ning.

 ippiMalloc_16u_C1 allocates image and returns step between consistent lines in &srcStep1. This function aligns data and steps on 64bytes by performance reasons. Also steps in IPP are in bytes always.  So for (Ipp16u*) pointers used in workaround  for calculation "full_offset" from image origin we use  "offset_from_top"= (srcStep1>>1)*borderTop and "offset_from_left"=borderLeft. "full_offset" ="offset_from_top"+"offset_from_left". I.e.

(srcStep1 >> 1)*borderSize.borderTop + borderSize.borderLeft is correct.

Andrey.

0 Kudos
Liu__Ning
Beginner
978 Views

Andrey Bakshaev (Intel) wrote:

Hi Ning.

 ippiMalloc_16u_C1 allocates image and returns step between consistent lines in &srcStep1. This function aligns data and steps on 64bytes by performance reasons. Also steps in IPP are in bytes always.  So for (Ipp16u*) pointers used in workaround  for calculation "full_offset" from image origin we use  "offset_from_top"= (srcStep1>>1)*borderTop and "offset_from_left"=borderLeft. "full_offset" ="offset_from_top"+"offset_from_left". I.e.

(srcStep1 >> 1)*borderSize.borderTop + borderSize.borderLeft is correct.

Andrey.

 

Hello Andrey,

thank you again for detail explanation, I still not completely understand, to calculate the "offset_from_top" we need to multiply (srcStep1>>1), and why for "offset_from_left" we don't have to multiply with it. And why scrStep1 need to divided by 2 (scrStep1>>1).

Kind regards,

Ning

0 Kudos
Andrey_B_Intel
Employee
978 Views

It is specific of pointer arithmetic in C. To avoid division  we can convert to Ipp8u* pointers  

(Ipp16u*) ((Ipp8u*)pSrc1 + srcStep1*borderSize.borderTop  +  borderSize.borderLeft * sizeof(Ipp16u))

Also I am attaching picture, where srcStep=14, top=2,left=2. The offset from A to B in bytes is 14*2+2*2=32 bytes (or in shorts (7*2) +2 = 16 shorts)

Andrey.

 

0 Kudos
Liu__Ning
Beginner
978 Views

Andrey Bakshaev (Intel) wrote:

It is specific of pointer arithmetic in C. To avoid division  we can convert to Ipp8u* pointers  

(Ipp16u*) ((Ipp8u*)pSrc1 + srcStep1*borderSize.borderTop  +  borderSize.borderLeft * sizeof(Ipp16u))

Also I am attaching picture, where srcStep=14, top=2,left=2. The offset from A to B in bytes is 14*2+2*2=32 bytes (or in shorts (7*2) +2 = 16 shorts)

Andrey.

 

Hello Andrey,

Thank you very much! Now I understand.

Kind regards,

Ning

0 Kudos
Reply