Extremely poor performance for Sum functions in IPP

www_q_ · ‎11-13-2012

Considering the following code: //code 1: IPP for (int i=0;i; }; }; Code 2 is 30-40 times faster than code 1, even for case that datasize is fairly large(1million in size), how could this be the case?

www_q_ · ‎11-13-2012

//code 1: IPP for (int i=0;i; }; };

SergeyKostrov · ‎11-13-2012

I recently evaluated a performance of ippsAdd_32f and ippsSub_32f IPP functions against a classic 'for'-based calculations ( similar to yours ) and IPP functions are at least twice faster. ippsSum_32f is similar to some degree to functions I mentioned. Take into account that performance depends on a version of IPP, a CPU and instruction set selected. You need to provide more technical details ( IPP version, a complete test-case, CPU, etc ) because your statement is too generic.

www_q_ · ‎11-13-2012

See attachment for the complete code I used to test: The code is very simply, which is the computation of the pair-wise distances between vectors. The compiler I used is Intel C++ compiler 13.0, and the IPP coming with this (Intel C++ composer 2013 XE). The settings for the compiler is x86-64/favor faster code/AVX-enabled/highest level of optimisation:O3/OpenMP enabled/IPP multi-threaded static lib. The system is windows 7-64. The test is conducted on a 4-core Intel sandy bridge i7 cpu (mobile, I think it is i7 2670Q.. ). As for the test results: Overall, the classic for-loops routine is about 4X faster comparing to the IPP routine. I suspect the function-calling overheads/memory latencies are huge there. Note: In the code, I used arrayfire(a GPU computing package)'s timing function timer::toc()/timer::tic() to timing my code, if you dont have that software installed, you can replace the timing functions with the classic clock() or whatever else you have, and of cause, dont forget to remove the header (arrayfire.h) and (using namespace af) all together in the code if you dont have arrayfire installed.

www_q_ · ‎11-13-2012

As for the code pasted in my first post here, it is simply a loop to do sum of a vector 100 times, comparing IPP's sum functions vs classic for-loops, with the same compiler/system settings I pasted, the for-loops is 30-40X faster than the IPP functions there.

SergeyKostrov · ‎11-13-2012

>>... the for-loops is 30-40X faster than the IPP functions there... That is very impressive and something is wrong. Here are results of my quick verification: Data Size : 1048576 bytes Number of Tests: 1024 *** Test Results 1.1 - ippsSum_32f with ippAlgHintNone *** [ Generic Sum ] Completed - Rolled Loops - 1-in-1 Base Sum: 1048576.000000 [ Generic Sum ] Completed - UnRolled Loops - 4-in-1 76% faster than Base Sum: 1048576.000000 [ ippsSum_32f ] Completed 86% faster than Base Sum: 1048576.000000 *** Test Results 1.2 - ippsSum_32f with ippAlgHintFast *** [ Generic Sum ] Completed - Rolled Loops - 1-in-1 Base Sum: 1048576.000000 [ Generic Sum ] Completed - UnRolled Loops - 4-in-1 76% faster than Base Sum: 1048576.000000 [ ippsSum_32f ] Completed 86% faster than Base Sum: 1048576.000000 *** Test Results 1.3 - ippsSum_32f with ippAlgHintAccurate *** [ Generic Sum ] Completed - Rolled Loops - 1-in-1 Base Sum: 1048576.000000 [ Generic Sum ] Completed - UnRolled Loops - 4-in-1 76% faster than Base Sum: 1048576.000000 [ ippsSum_32f ] Completed 86% faster than Base Sum: 1048576.000000

SergeyKostrov · ‎11-13-2012

>>...I recently evaluated a performance of ippsAdd_32f and ippsSub_32f IPP functions against a classic 'for'-based calculations ( similar to >>yours ) and IPP functions are at least twice faster... Here are test results: *** Test Results 2.1 *** Data Size : 524288 bytes Number of Tests: 1024 [ Generic Add ] Completed - Rolled Loops - 1-in-1 Executed in: 1270 ticks [ Generic Sub ] Completed - Rolled Loops - 1-in-1 Executed in: 1161 ticks [ Generic Add ] Completed - UnRolled Loops - 4-in-1 Executed in: 739 ticks [ Generic Sub ] Completed - UnRolled Loops - 4-in-1 Executed in: 743 ticks [ ippsAdd_32f ] Completed Executed in: 581 ticks [ ippsSub_32f ] Completed Executed in: 575 ticks *** Test Results 2.2 *** Data Size : 1048576 bytes Number of Tests: 1024 [ Generic Add ] Completed - Rolled Loops - 1-in-1 Executed in: 2410 ticks [ Generic Sub ] Completed - Rolled Loops - 1-in-1 Executed in: 2293 ticks [ Generic Add ] Completed - UnRolled Loops - 4-in-1 Executed in: 1490 ticks [ Generic Sub ] Completed - UnRolled Loops - 4-in-1 Executed in: 1485 ticks [ ippsAdd_32f ] Completed Executed in: 1153 ticks [ ippsSub_32f ] Completed Executed in: 1160 ticks

SergeyKostrov · ‎11-13-2012

Attached is my test-case for ippsSum_32f function. Could you try to execute it on a single core with OpenMP disabled ( in another words, as a single threaded application )?

SergeyKostrov · ‎11-13-2012

>>...I suspect the function-calling overheads/memory latencies are huge there... Let's wait for a response from Intel Software Engineers. What IPP version is installed on your computer?

Ying_H_Intel · ‎11-13-2012

Hi Sergey , www, q, thanks for raising the issue here. Before we investigate it, i noticed there are some problems about the code, 1) if use statc link, the call ippInit() is needed before call all IPP functions, please see http://software.intel.com/en-us/articles/ipp-dispatcher-control-functions-ippinit-functions 2) Regarding the OpenMP threads and IPP internal threading. Some discussion are in http://software.intel.com/en-us/articles/openmp-and-the-intel-ipp-library . Don't recommend to use nested IPP and OpenMP threads. also disable HT etc. 3) functions call overheads, yes, it is possible, a serial of IPP function calls have overhead and the serial operations increase the times of memory write-in and write out . As a result , memory latences may eat the benefit of faster computing. In order to simplify the performance comparng, as sergey suggested, how about try serial code both loops and IPP serial library. Please let us know after chang ippInit() and OpeMP set, what is the result of first post? Best Regards, Ying

SergeyKostrov · ‎11-14-2012

Hi Ying, What did you want to tell here ( between arrow-left and arrow-right )? >>...2) Regarding the OpenMP threads and IPP internal threading. Some discussion are in <>...

SergeyKostrov · ‎11-14-2012

>>...Don't recommend to use nested IPP and OpenMP threads. also disable HT etc... I've verified my test-case in my software development environment and it was a pure single-threaded test.

www_q_ · ‎11-14-2012

@Ying: This raise some interesting question: Do I really need to put the ippinit() there? actually I was intentionally not to call that function, in the hope of reduce some function-call overheads: In the intel compiler setting options (visual studio), I selected the option which tell the Intel compiler to use the specialized optimized AVX path, which makes the program can only run on a CPU supporting AVX (all my targeted systems installed with CPUs with AVX support, so I could careless about other paths). With this setting, I have not noticed any performance difference between calling ippInit() or not so far, thus I have not use that function very often laterly (hopefully to reduce function-calling overheads by a little...). This really raise some question here, are you sure, even if setting the intel compiler to use optimized path for AVX only, I still need to call this initilization function in my code? As for your (both) other inputs, many thanks, I will try them when I am at work.

Ying_H_Intel · ‎11-15-2012

@ www.q, Good quesiton. As IPP is library, the extenal Compiler option like AVX support worn't influence ipp code. The ippInit() should be called one timesbefore all IPP functions are called. It dispatch cpu-specific optmized code for your application when static link. Othewise, no cpu-optimized code will run. Unless you manually initialize the dispatcher to insure optimum performance of the IPP library with your application. Best Regards, Ying