topic Mikhail, in Intel® Integrated Performance Primitives

Why is IPP DMIP slower than Concurrency::parallel_for?

Mikhail_Matrosov — Tue, 30 Apr 2013 11:42:23 GMT

I've downloaded Intel IPP DMIP sample: ipp-samples.7.1.1.013. I built application\dmip_bench\ utility against IPP v7.1.1. It showed significant performance boost of DMIP flavor against IPP flavor.

I then refactored ModifyBrightness::DoIPP method to simply process image by rows, and parallelized this processing with Concurrency::parallel_for. Then I rebuild the solution with both _IPP_SEQUENTIAL_STATIC and _IPP_PARALLEL_DYNAMIC macros. And the results was unexpected.

With _IPP_SEQUENTIAL_STATIC:

DMIP 1.5 Jul 12 2012
ippIP SSSE3 (v8) 7.1.1 (r37466) Sep 24 2012
ippCV SSSE3 (v8) 7.1.1 (r37466) Sep 24 2012
ippCC SSSE3 (v8) 7.1.1 (r37466) Sep 25 2012
Number of threads: 2
DMIP Modify Brightness example time 3.16375 msec slice 34
IPP Modify Brightness example time 1.85974 msec slice 467
Close the session

With _IPP_PARALLEL_DYNAMIC:

DMIP 1.5 Jul 12 2012
ippIP SSSE3 (v8) 7.1.1 (r37466) Sep 27 2012
ippCV SSSE3 (v8) 7.1.1 (r37466) Sep 27 2012
ippCC SSSE3 (v8) 7.1.1 (r37466) Sep 28 2012
Number of threads: 2
DMIP Modify Brightness example time 2.34378 msec slice 34
IPP Modify Brightness example time 6.75662 msec slice 467
Close the session

As you can see, manually parallelized version works better, than DMIP. Why?

I used Visual Studio 2010 for compilation. Under Windows 7 x64. Solution configuration was x86. I have Intel E6550 processor. I used an RGB 1200x467 image.

I attached modified sample. With compiled executables and output logs.

Still waiting for response.

Mikhail_Matrosov — Thu, 16 May 2013 06:29:03 GMT

Still waiting for response.

Hi Mikhail,

Sergey_K_Intel — Thu, 16 May 2013 06:57:20 GMT

Hi Mikhail,

Sorry for late response. Could you return back to unmodified version of dmip_bench and compare sequential_static vs. parallel_dynamic results?

I have a suspicion that linking DMIP-based application to threaded libraries only harms to overall performance due to thread oversubscription. Or, somewhere in main() you should set ippSetNumThreads(1). DMIP itself already uses all available CPU cores and if application will split execution further adding new threads, nothing good may happen.

Regards,
Sergey

Sergey,

Mikhail_Matrosov — Thu, 16 May 2013 09:28:12 GMT

Sergey,

I believe there is no need in such a test. What important is that my own naive parallelization for single-threaded libraries works faster than DMIP linked against both single- and multi-threaded libraties. Could you run provided code on your machine and check it?

Hi Mikhail,

SergeyKostrov — Thu, 16 May 2013 13:28:01 GMT

Hi Mikhail, >>...What important is that my own naive parallelization for single-threaded libraries works faster than DMIP linked >>against both single- and multi-threaded libraties... This is possibly because your codes have less overheads or partitioned a data set in a right way ( you know that all these cache related issues could significantly affect performance ).

Sergey,

Mikhail_Matrosov — Thu, 16 May 2013 13:52:07 GMT

Sergey,

That's it, I exptected DMIP will partition a data in the most effective way. It is said, it knows the size of caches and all the hardware stuff.

My point is, I afraid to use DMIP in my projects after these results. And I hoped I was doing something wrong.

>>...My point is, I afraid to

SergeyKostrov — Thu, 16 May 2013 14:07:00 GMT

>>...My point is, I afraid to use DMIP in my projects after these results.... I don't think that DMIP is too popular compared to IPP and you know that additional software layers create additional overheads ( reduce performance ). That is why .NET applications are slower that pure C/C++ applications, etc.

Mikhail,

Sergey_K_Intel — Thu, 16 May 2013 15:27:51 GMT

Mikhail,

I am currently investigating this. The performance profile data is quite strange like this (not sure that the table will look good):

Top Hotspots
Function CPU Time
DMIP::Trace::Space 33.166s
own_ipps_sExp_G9LAynn 5.493s
g9_innerRGBToGray_8u_C3C1R 3.551s
[dmip-1.5.dll] 2.947s
ippGetCpuFreqMhz 2.325s
[Others] 11.318s

I will check what's going wrong.

Regards,
Sergey

Hi Mikhail,

Sergey_K_Intel — Tue, 28 May 2013 13:06:59 GMT

Hi Mikhail,

I found the nature of issue. DMIP.dll linked to your application is statically linked with IPP 7.0.x (so, it contains the code of IPP 7.0.x), while your separated IPP calls are linked to newer 7.1 library, which probably contains better optimized functions that you use. This is why manually parallelized IPP functions pipeline works better.

I have linked DMIP object files with IPP 7.1 into other DMIP.DLL and this combination shows exactly the same results as your parallelized IPP calls.

If you ask me when DMIP.dll, linked with IPP 7.1, will be released, I won't answer. Currently, DMIP project future is under consideration. Do you feel that this image processing implementation has some potential?

Regards,
Sergey

Dear Sergey,

Mikhail_Matrosov — Tue, 28 May 2013 14:03:19 GMT

Dear Sergey,

Thank you for your thorough investigation on the issue!

I think DMIP will be popular only in case it will provide a simple and transparent interface, so the raltion of simplicity to performance will bit the one for manual parallelization and GPU techniques. It easier for us to use IPP instead of manual arithmetics and it's much faster. And it is way easier than to go to GPU.

I'm sure DMIP will become very popular if integrated into OpenCV's cv::Mat integrated arithmetic resolving system. They already automatically construct a graph based on the very simple and intuitive operator overloading patterns. Like D = (A + B) * C. For now, we are not using OpenCV because it lacks integration with IPP and internal parallelization. But it doesn't look like a tricky task to resolve both of these issues.

Thank you for valuable

Sergey_K_Intel — Tue, 28 May 2013 15:15:57 GMT

Thank you for valuable thoughts! This will be definitely taken into account.

Best regards,
Sergey

>>...For now, we are not

SergeyKostrov — Wed, 29 May 2013 13:49:38 GMT

>>...For now, we are not using OpenCV because it lacks integration with IPP and internal parallelization... OpenCV is a very old ( 12+ years ) library and was not designed to do processing in parallel. It simply wasn't a project objective.