I'm using Intel IPP 8 (On the latest 2015 Edition of Composer).
Intel IPP offers Min / Max Filters and Erode / Dilate Filters.
One could use Erode (With all 1 Mask) for Min and Dilate (With all 1 Mask) for Max.
Yet, what I'm curious about is how come Min / Max aren't faster than Erode / Dilate?
Since they are "Private" cases I would assume they can be much farther be optimized.
Could anyone comment on that?
Does any of them benefit from using the Multi Threaded library?
you may check it by himself on your particular CPU type. For this purpose, IPP provides performance system. You may find it into IPPROOT\windows\ipp\tools\intel64\perfsys\
I don't understand how your answer would help the issue?
The issue is as following, How come Min / Max filters are the same speed as Erose / Dilate filters?
They should be much faster.
Moreover, I asked if those are properly Multi Threaded so gains are made.
1. the performance would depends of CPU type you are working on. Therefore using perf system on your CPU with your problem sizes would help you to find the answers on these questions. 2.To see the list of all threaded APIs, refer to the ThreadedFunctionsList.txt file located in the documentation directory of the Intel IPP installation.
It's very easy to understand - when morphology mask is filled with all "1"s - IPP function calls FilterMin or FilterMax. Regarding your question "if those are properly Multi Threaded so gains are made" - they are not threaded, you can find a list of threaded functions in the "threaded_functions_list.txt" file that is available with each IPP release - only advanced morphology functionality was threaded.
You confirm what we were thinking happens behinds the scenes, that those are the same.
We were afraid the opposite is happening, namely calling FilterMax / FilterMin calls Dilation / Erosion which means not all optimization applied.
Just for knowledge what happens for for functions which are not MultiThreaded when I link against the Multi Threaded Static version of IPP?
Multi-threaded version of IPP functions is built with /Qopenmp compiler switch and in some cases, if some function doesn't have OMP code branch, Intel compiler may perform auto-threading of IPP code - but this is very rare case. In the most cases non-threaded functions have got the same code as in the single-threaded library.
One note: IPP threaded libraries are in deprecated state. Currently we provide so called "platform-aware" API set (functions with "_L" suffix that use 32-bit int for size parameters on ia32 and 64-bit on Intel64) and we provide threading layer for this set ("_TL" suffix) that is available in source and prebuild forms. The main goal is to force IPP customers to build threaded pipelines of IPP functions and "by-tile" processing instead of calling internally-threaded functions. External threading for some pipeline that works by tiles is 10x-100x more efficient than approach existing in threaded IPP libraries. In IPP 2017 about ~600-700 functions got such APIs and in the future we are going to extend this list up to 100%.
> External threading for some pipeline that works by tiles is 10x-100x more efficient than approach existing in threaded IPP libraries
The conclusion from that is that the "approach existing" inside IPP is wrong and that it should be improved. Instead of forcing IPP users to do the threading themselves .....
> about ~600-700 functions got such APIs and in the future we are going to extend this list up to 100%.
As I noted before, not all functions can be tiled externally. One example is the Fourier transform.
Adriaan van Os
Hi Adriaan van Os,
~17 years ago we performed IPP optimization for Pentium Pro CPU. Then for Itanium. Then for IXP... Now all of them are not supported by the library. Does it mean wrong approach? We started OMP optimization for IPP more than 10 years ago. At the initial stage that was the most advanced known/possible approach for threading. After several years we came to DMIP (Deferred Mode Image processing sample with IPP 7 and 8) approach. It is the most advanced approach for threading (our point of view) - it creates pipelines of IPP functions and processes them by tiles. Internal threading is not wrong approach as it provides visible speedup to single-threaded code and is very simple in use. As I've written above - we provide source and binary versions of "_TL" APIs - therefore anyone can use "simplest" approach and link with binary, or use advanced one and create a pipeline from provided sources (that are just wrappers over "_L" functions, that perform tiling and threading).
I agree with the statement that not all functions can be tiled/threaded externally, but FFT can be tiled externally - for 2 threads you can run 2 FFT of order N-1 and then perform radix2 butterfly, for 4 - radix4, etc. For 2D FFT you can use "slices" instead of "tiles" without any additional post-radix processing. Of course it will be some special node in such pipeline, and FFT by rows will be "blocking" point, but after transposition and FFT by "slice" columns the first "slice" can be used for some next stage. Life is not easy...
Thank you for your answer and sharing your knowledge.
I think you should let the user the option of both.
Library where each function is optimized to the max using Multi Threading and Library which is single thread with Tiling Optimization.
I'm not an expert and doing my first steps.
In my case I'm doing Gaussian Blur then Max then Gaussian Blur.
I'm not sure that using Tiling would help in this case as all operations are Spatially.
Hence either each tile will have large overhead (And small areas to keep all in the cache) or I will need to sync after each operation which basically bring me back to classic processing.
How would you use the new functions in this kind of case (Cascaded Simple Spatial operations each limited mainly by memory)?
I for example would expect Intel IPP to have 2 versions of Gaussian Blur.
One is the Classic one and the other is "Approximation" using O(n) method optimized to the max (World's Fastest CPU Gausssian Blur).
Practically, it is much slower than Photoshop's Gaussian Blur.
Come on, that is absurd, then you can just as well do all of the FFT yourself. Libraries are there to make things simple, not complex. One uses a library to save time, not having to write it yourself. To make things simple, the library can do the threading (and just as fast as external threading).
Of course, it is Intel that owns IPP and takes the decisions. But there are no technical reasons for that decision. I just strongly object to the impression given here that external threading is faster than (the right) internal threading. That is simply not true. If there is a political decision to discontinue internal threading (be it so) it shouldn't be sold as technical decision.
Adriaan van Os