- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
currently we are using IPP 5.2 in our application, I try to replace it with IPP 2019 with Nuget package. I don't understand the performance comparison of resize with CUBIC between IPP 5.2 and IPP 2019.
The resize test is that the size of the destination image is (240, 217), one part of the source image will be zoomed to the destination's size.
When one image (60 * 54) is zoomed 4 times, the resize cubic function of IPP 5.2 runs faster than IPP 2019.
When one image (30 * 27) is zoomed 8 times, the resize cubic function of IPP 5.2 runs still faster than IPP 2019. And in this time IPP 2019 itself is also slower than zoomed 4 times using IPP 2019.
My question is that,
Why is IPP 2019 slower than IPP 5.2?
Why is using IPP 2019 zoom 8 times slower than zoom 4 times. When zooming 8 times, the processed image size is only a quarter of the zooming 4 times?
Thank you in advance.
Ning
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi.
I am a bit confused. I've modified the reproducer. I see at my avx2 systems the following numbers:
Xeon Silver 4116 2.10Ghz
ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30, 27) -> ( 60, 54), 42903.60, 9273.60
( 30, 27) -> (120, 108), 137363.20, 20176.80
( 90, 81) -> ( 60, 54), 30096.00, 18941.60
( 90, 81) -> (120, 108), 78498.00, 42430.40
Core i5 7300u 2.7Ghz
ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30, 27) -> ( 60, 54), 82597.00, 19693.80
( 30, 27) -> (120, 108), 474363.80, 57542.40
( 90, 81) -> ( 60, 54), 62416.60, 56303.40
( 90, 81) -> (120, 108), 106889.40, 50775.00
Ning, could you please build this reproducer as separated application and send me output from it?
Thanks.
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi.
I am a bit confused. I've modified the reproducer. I see at my avx2 systems the following numbers:
Xeon Silver 4116 2.10Ghz
ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30, 27) -> ( 60, 54), 42903.60, 9273.60
( 30, 27) -> (120, 108), 137363.20, 20176.80
( 90, 81) -> ( 60, 54), 30096.00, 18941.60
( 90, 81) -> (120, 108), 78498.00, 42430.40
Core i5 7300u 2.7Ghz
ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30, 27) -> ( 60, 54), 82597.00, 19693.80
( 30, 27) -> (120, 108), 474363.80, 57542.40
( 90, 81) -> ( 60, 54), 62416.60, 56303.40
( 90, 81) -> (120, 108), 106889.40, 50775.00
Ning, could you please build this reproducer as separated application and send me output from it?
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andrey Bakshaev (Intel) wrote:Hi.
I am a bit confused. I've modified the reproducer. I see at my avx2 systems the following numbers:
Xeon Silver 4116 2.10Ghz
ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30, 27) -> ( 60, 54), 42903.60, 9273.60
( 30, 27) -> (120, 108), 137363.20, 20176.80
( 90, 81) -> ( 60, 54), 30096.00, 18941.60
( 90, 81) -> (120, 108), 78498.00, 42430.40Core i5 7300u 2.7Ghz
ippIP AVX2 (h9), 2019.0.5 (r0xc95fdf5f)
( 30, 27) -> ( 60, 54), 82597.00, 19693.80
( 30, 27) -> (120, 108), 474363.80, 57542.40
( 90, 81) -> ( 60, 54), 62416.60, 56303.40
( 90, 81) -> (120, 108), 106889.40, 50775.00Ning, could you please build this reproducer as separated application and send me output from it?
Thanks.
Hi Andrey,
here is the output from my computer, Thank you!
ippIP AVX2 (h9), 2019.0.4 (r62443)
( 30, 27) -> ( 60, 54), 35463.60, 8669.20
( 30, 27) -> (120, 108), 122966.40, 18862.40
( 90, 81) -> ( 60, 54), 27424.40, 17297.20
( 90, 81) -> (120, 108), 73044.80, 37000.80
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ning.
Well. The workaround with an additional buffer for small sizes works 2x-4x faster. Could you please compare these 2 approaches in your application inside function "resize_bench_additional_buffer"?
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andrey Bakshaev (Intel) wrote:Hi Ning.
Well. The workaround with an additional buffer for small sizes works 2x-4x faster. Could you please compare these 2 approaches in your application inside function "resize_bench_additional_buffer"?
Thanks.
Hello Andrey,
I've test your solution inside our application, this is really faster than before, even faster then IPP6. Thank you again for this solution!
I don't understand this line.
ippiResizeCubic_16u_C1R(pSrc1 + (srcStep1 >> 1)*borderSize.borderTop + borderSize.borderLeft, srcStep1, pDst, dst->widthStep, dstOffset, dstRoiSize, ippBorderInMem, &borderValue, pSpec, pBuffer);
this part (srcStep1 >> 1)*borderSize.borderTop + borderSize.borderLeft.
Could you please explain the meaning of this part?
Thank you in advance!
Kind regards,
Ning
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ning.
ippiMalloc_16u_C1 allocates image and returns step between consistent lines in &srcStep1. This function aligns data and steps on 64bytes by performance reasons. Also steps in IPP are in bytes always. So for (Ipp16u*) pointers used in workaround for calculation "full_offset" from image origin we use "offset_from_top"= (srcStep1>>1)*borderTop and "offset_from_left"=borderLeft. "full_offset" ="offset_from_top"+"offset_from_left". I.e.
(srcStep1 >> 1)*borderSize.borderTop + borderSize.borderLeft is correct.
Andrey.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andrey Bakshaev (Intel) wrote:Hi Ning.
ippiMalloc_16u_C1 allocates image and returns step between consistent lines in &srcStep1. This function aligns data and steps on 64bytes by performance reasons. Also steps in IPP are in bytes always. So for (Ipp16u*) pointers used in workaround for calculation "full_offset" from image origin we use "offset_from_top"= (srcStep1>>1)*borderTop and "offset_from_left"=borderLeft. "full_offset" ="offset_from_top"+"offset_from_left". I.e.
(srcStep1 >> 1)*borderSize.borderTop + borderSize.borderLeft is correct.
Andrey.
Hello Andrey,
thank you again for detail explanation, I still not completely understand, to calculate the "offset_from_top" we need to multiply (srcStep1>>1), and why for "offset_from_left" we don't have to multiply with it. And why scrStep1 need to divided by 2 (scrStep1>>1).
Kind regards,
Ning
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is specific of pointer arithmetic in C. To avoid division we can convert to Ipp8u* pointers
(Ipp16u*) ((Ipp8u*)pSrc1 + srcStep1*borderSize.borderTop + borderSize.borderLeft * sizeof(Ipp16u))
Also I am attaching picture, where srcStep=14, top=2,left=2. The offset from A to B in bytes is 14*2+2*2=32 bytes (or in shorts (7*2) +2 = 16 shorts)
Andrey.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andrey Bakshaev (Intel) wrote:It is specific of pointer arithmetic in C. To avoid division we can convert to Ipp8u* pointers
(Ipp16u*) ((Ipp8u*)pSrc1 + srcStep1*borderSize.borderTop + borderSize.borderLeft * sizeof(Ipp16u))
Also I am attaching picture, where srcStep=14, top=2,left=2. The offset from A to B in bytes is 14*2+2*2=32 bytes (or in shorts (7*2) +2 = 16 shorts)
Andrey.
Hello Andrey,
Thank you very much! Now I understand.
Kind regards,
Ning
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »