threading example for Gaussian Blur

Chao_Y_Intel · ‎04-05-2017

Attached is an example for using the IPP with external threading with the IPP ippiFilterGaussianBorder_32f_C1R function.

The code shows the sequential IPP code, OpenMP threaded code, and threaded code with new 64 bit length IPP APIs.

Chao_Y_Intel · ‎04-05-2017

the file is attached.

Royi · ‎04-12-2017

@Chao,

Do you see any speed gains over the regular Gaussian Blur (Single Threaded)?

Thank You.

Pavel_V_Intel · ‎04-14-2017

Hi Royi,

Gaussian isn't computational intensive and mostly memory bound, so it doesn't scale well.

Below are some digits for my i7-4771 Haswell 4 cores CPU. New API makes threading more effective for small sizes, but it is all the same for big data. Although if you have more memory bandwidth relative to single CPU power (e.g. multi CPU system) it should scale more efficiently.

Intel IPP Classic API, 4 threads:
Size            | Kernel        | Time ST (ms)  | Time MT (ms)  | Ratio         | Accuracy
640 x480        | 3             | 0.102439      | 0.088249      | 1.160788      | 0.000000
1280x720        | 3             | 0.380726      | 0.221540      | 1.718545      | 0.000000
1920x1080       | 3             | 0.898260      | 0.853800      | 1.052072      | 0.000000
3840x2160       | 3             | 4.248261      | 4.722100      | 0.899655      | 0.000000
7680x4320       | 3             | 17.135358     | 18.369172     | 0.932832      | 0.000000
640 x480        | 5             | 0.145480      | 0.104358      | 1.394053      | 0.000000
1280x720        | 5             | 0.567356      | 0.217716      | 2.605943      | 0.000000
1920x1080       | 5             | 1.149345      | 0.877235      | 1.310190      | 0.000000
3840x2160       | 5             | 5.410349      | 4.663034      | 1.160264      | 0.000000
7680x4320       | 5             | 24.268234     | 18.621293     | 1.303252      | 0.000000

Intel IPP Platform-Aware API, 4 threads:
Size            | Kernel        | Time ST (ms)  | Time MT (ms)  | Ratio         | Accuracy
640 x480        | 3             | 0.102084      | 0.031085      | 3.284073      | 0.000000
1280x720        | 3             | 0.333712      | 0.127450      | 2.618374      | 0.000000
1920x1080       | 3             | 0.835933      | 0.901792      | 0.926968      | 0.000000
3840x2160       | 3             | 4.185757      | 4.617177      | 0.906562      | 0.000000
7680x4320       | 3             | 17.079477     | 12.959909     | 1.317870      | 0.000000
640 x480        | 5             | 0.141791      | 0.046700      | 3.036188      | 0.000000
1280x720        | 5             | 0.522360      | 0.207828      | 2.513427      | 0.000000
1920x1080       | 5             | 1.143512      | 0.915082      | 1.249628      | 0.000000
3840x2160       | 5             | 5.303817      | 4.674198      | 1.134701      | 0.000000
7680x4320       | 5             | 21.571494     | 12.741365     | 1.693028      | 0.000000

Royi · ‎04-15-2017

@Pavel,

What's "Platform Aware API"?
I see it uses <>_L which I have never seen.

How are those compared to the implementation of the Gaussian Blur in the Multi Threaded Library?

Thank You.

Pavel_V_Intel · ‎04-17-2017

What's "Platform Aware API"?
I see it uses <>_L which I have never seen

This is a new API with support 64-bit memory size, it is declared in _l.h headers and all functions with it has _L suffix.

Also new API for Gaussian has improved accuracy and performance. I forgot to take it into account in previous measurements. Here threaded versions are compared with sequential versions of the same API:

Intel IPP Classic API, 4 threads:
Size            | Kernel        | Time ST (ms)  | Time MT (ms)  | Ratio         | Accuracy
640 x480        | 5             | 0.193650      | 0.106887      | 1.811725      | 0.000000
1280x720        | 5             | 0.612832      | 0.218480      | 2.804982      | 0.000000
1920x1080       | 5             | 1.901937      | 1.180005      | 1.611804      | 0.000000
3840x2160       | 5             | 7.701745      | 4.749115      | 1.621722      | 0.000000
7680x4320       | 5             | 34.441417     | 18.632457     | 1.848464      | 0.000000

Intel IPP Platform-Aware API, 4 threads:
Size            | Kernel        | Time ST (ms)  | Time MT (ms)  | Ratio         | Accuracy
640 x480        | 5             | 0.141898      | 0.046471      | 3.053461      | 0.000000
1280x720        | 5             | 0.519318      | 0.190972      | 2.719336      | 0.000000
1920x1080       | 5             | 1.154827      | 0.919344      | 1.256143      | 0.000000
3840x2160       | 5             | 5.292439      | 4.695001      | 1.127250      | 0.000000
7680x4320       | 5             | 21.487982     | 12.724610     | 1.688695      | 0.000000

As for Threaded Library it doesn't seem that Gaussian there is threaded at all:

Intel IPP Classic Threaded API, 4 threads:
Size            | Kernel        | Time ST (ms)  | Time MT (ms)  | Ratio         | Accuracy
640 x480        | 5             | 0.196998      | 0.192450      | 1.023634      | 0.000000
1280x720        | 5             | 0.621252      | 0.611541      | 1.015880      | 0.000000
1920x1080       | 5             | 1.763726      | 1.761997      | 1.000981      | 0.000000
3840x2160       | 5             | 7.491765      | 7.533920      | 0.994405      | 0.000000
7680x4320       | 5             | 33.324637     | 32.030418     | 1.040406      | 0.000000

Also here is comparison between APIs in single-thread mode:

Intel IPP Classic API vs Platform-Aware API:
Size            | Kernel        | Time C (ms)   | Time PA (ms)  | Ratio         | Accuracy
640 x480        | 5             | 0.194554      | 0.142810      | 1.362328      | 0.000000
1280x720        | 5             | 0.620718      | 0.512520      | 1.211109      | 0.000000
1920x1080       | 5             | 1.767950      | 1.133819      | 1.559288      | 0.000000
3840x2160       | 5             | 7.499314      | 5.308956      | 1.412578      | 0.000000
7680x4320       | 5             | 33.466464     | 21.056124     | 1.589393      | 0.000000

Pavel_V_Intel · ‎04-17-2017

My code is in attachment

Royi · ‎04-17-2017

@Pavel,

This is great service!

So it seems the new functions are even better optimized.
When it says "Platform Aware" what happens behind the scenes which makes it faster?
When you say 64 Bit Memory Size, what do you mean (MY guess better optimization for 64 Bit Memory Channels)?

As a feature request, why don't you make a "Multi Threading Template" for Border Type Filters where the user only needs to send a function pointer and have "Multi Threaded" performance out of the box?
You can also create efficient template for "Pixel Wise" operations.

Thank You.

Pavel_V_Intel · ‎04-17-2017

When it says "Platform Aware" what happens behind the scenes which makes it faster?
When you say 64 Bit Memory Size, what do you mean (MY guess better optimization for 64 Bit Memory Channels)?

Nothing so fancy, it means exactly that the function uses input parameters of architecture dependent sizes like size_t. Platform Aware functions use IppSizeL type to pass such parameters as memory steps and memory sizes. On x86 IppSizeL is 32-bit signed integer, on x86_64 it is 64-bit signed integer. E.g. you can pass size > 2GB to ippMalloc_L, while ippMalloc is limited to int type.

Gaussian in particular is faster for new API just because only new API version was optimized further.

As a feature request, why don't you make a "Multi Threading Template" for Border Type Filters where the user only needs to send a function pointer and have "Multi Threaded" performance out of the box?
You can also create efficient template for "Pixel Wise" operations.

Some function groups can be templated like this, but this actually will require quite many different templates.

Our current long-term solutions for replacement of Threading Library is Threading Layer API (_tl.h headers). You can also try Integration Wrappers. They don't provide threading themselves but make threading easier for complex functions. https://software.intel.com/en-us/forums/intel-integrated-performance-primitives/topic/704063

Royi · ‎12-23-2017

@Pavel,

It seems there is a bug in your implementation.
The result isn't identical to Gaussian Blur (Something with boundaries).

Have you compared it to the regular Gaussian Blur?
In my system the result isn't identical.

Thank You.