Re:Resize by Super interpolation by ippi 2019 and 2020 is slower than 5.2 (ippu8-5.2.dll) - Page 3

D12 · ‎10-10-2020

Hi,

i have an image of 1.5GB which i would like to scale down by Super interpolation, on Intel(R) Xeon(R) CPU E5-2460 v3 @ 2.66Ghz (2 processors) 32 cores and memory of 128 GB.

Using Intel 2019 and 2020, i see that the more threads i use, the slower it takes to scale down using Super interpolation. While testing it on ippiu8-5.2.dll (i don't which Intel it is...) i get faster performance when i use more threads. The problem doesn't exist for Cubic interpolation. It works as expected.

The sample below shows the performance time to scale down image of 1.5GB with Super interpolation by factor of 0.27 using different number of threads. Each case was tested 3 times:

Using Intel 2019 and 2020:

threads = 4, time=842 ms, threads = 4, time=670 ms, threads = 4, time=655 ms
threads = 8, time=718 ms, threads = 8, time=718 ms, threads = 8, time=749 ms
threads = 16, time=967 ms, threads = 16, time=920 ms, threads = 16, time=921 ms
threads = 24, time=1201 ms, threads = 24, time=1092 ms, threads = 24, time=1170 ms

.Using old version of Intel (ippu8-5.2.dll):

threads = 4, time=1092 ms, threads = 4, time=1123 ms, threads = 4, time=1092 ms
threads = 8, time=577 ms, threads = 8, time=562 ms, threads = 8, time=562 ms
threads = 16, time=375 ms, threads = 16, time=375 ms, threads = 16, time=374 ms
threads = 24, time=249 ms, threads = 24, time=249 ms, threads = 24, time=265 ms

Any solution to get better results when using more threads in Intel 2019 and 2020 for Resize Super Interpolation?

D12 · ‎03-24-2021

Vlad,

thanks for the information.

I came across another problem which has to do with number of rows that you divide the image.

scale-down image is based on output driven concept. Meaning, for each pixel on the output image pixel we calculate the source pixel and according to the interpolation that pixel gets its value.

Now, you modified the code (by creating function called CImage::ResizeMod) to help getting better performance time. Your concept was to create tileHeight to be yScale*100. for example: for scale 0.25 you create tileHeight=25 rows. Each threads run on several tiles.

The problem that i see is:

The more tiles the image is divided then less accurate the scale-down is. I mean, if for example pixel from row 7348 from the source image is scaled down by 0.25 then i expect to see it in row 7348*0.25=1837. However, i see the pixel shift by 4 rows pixels (which is a big deviation).

The problem mostly can be seen for rows at the end of the image.

I decided to modify you function and called it CImage::ResizeMod2. That function creates tileHeight to be image.rows / numThreads. Each thread works on one tile. So i have bigger tiles and less number of tiles than your function CImage::ResizeMod.

Doing so fixed the deviation.

I hope that i'm clear with what i wanted to describe....

I attached the modified files from the test application.

Check please if the location pixel on the output pixel is as expected according to the scale of the image.

Very simple to test: pickup pixel from the source image, check its (x,y) position and then calculate its expected position on the scaled-down image. Then, verify if that pixel is positioned as expected on the scaled-down image. Do it for pixels close to the rows at the end of the image.

If i'm not clear then let me know and i will try to make myself clear.

The test application asks which resize-scheme to use....

By the way, i also added which parallel mechanism to use as well, since i wanted to test time performance between parallel_for vs std::thread.

Vladislav_V_Intel · ‎03-25-2021

Hi Dudi,

Here is the updated example that splits image by rows and by columns. I added two new parameters - number of vertical chunks and number of horizontal chunks. On my side example works well with 96 vertical chunks and 64 horizontal but you can find parameters that works best for you and may be different parameters for different number of threads, image sizes and scale factors. Also it looks like there is no quality problem you described, I added couple test routines to CImage class (CreateTestImage() and CheckTestImage()). Could you please check performance and quality on your side?

Best regards,
Vlad V.

D12 · ‎03-25-2021

Hi Vlad,

thanks for your great efforts.

I will check it deeply after 4-April as i will be on vacation and let your know.

Regards,

Dudi

D12 · ‎03-25-2021

Hi Vlad,

short test showed indeed better results.

I don't understand how splitting the images to chunks on x and y axis helps to get better results. Could you please explain by some diagram or by other simple way to simplify the explanation about it?

D12 · ‎03-25-2021

Hi Vlad,

the test application crashes for image bigger than 2GB. Image size is 2.11GB (30720x24568).

I modified the code to read such image (by replacing int data type to size_t in different places).

However, the crash is in ippiResizeSuper_8u_C3R.

Could you please check why it crashes on images bigger than 2GB?

I attached back you files with my modifications to use size_t...

D12 · ‎03-25-2021

Vlad,

I found the reason for the crash for images bigger than 2GB and fixed it.

As part of changing int data type to size_t, I changed a line in the function ResizeByIntel2019Mod_resize:

Ipp8u* psrc=(Ipp8u*)((Ipp8u*)pSrcPixels + (srcOffset.y * srcStride) + (srcOffset.x * channels)) with

Ipp8u* psrc=(Ipp8u*)((Ipp8u*)pSrcPixels + ((size_t)srcOffset.y * srcStride) + (srcOffset.x * channels));

for image bigger than 2GB, srcOffset.y * srcStride has overflow result due to be integer and not size_t.

for your information...

I attached the files to support using image bigger than 2GB.

Gennady_F_Intel · ‎05-05-2021

This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.