Decrease performance of a sort function with IPP 8.0

P_v_ · ‎07-31-2013

Hello!!

I have recently passed from IPP7.1 to IPP8.0. I compare the performance of the two versions.

For the IPP8.0 and only for the sort function “ippsSortRadixIndexAscend_32f”, I note that the performance is worse than IPP7.1.

I obtain:

55 CPE for IPP7.1
78 CPE for IPP8.0

(The vectors are constituted by 1024 random samples and I realize 40,000 executions to obtain these averages. I use the function ippGetCpuClocks() to obtain the number of cpu clocks. My processor is an Intel Dual Core E5400, 2.70Ghz).

Have you got an explication?

Thank you,

Pierre

Thomas_Jensen1 · ‎07-31-2013

First you should verify which CPU/Library was used in the IPP 7 case as well as for the IPP 8 case.

Possibly, your IPP 7 code selects the best CPU/Library for your E5400, but your IPP 8 code not.

If you link to dymamic IPP DLL's, you could just use for instance SysInternals ProcessExplorer, to see which IPP CPU/Library DLL was loaded.

P_v_ · ‎07-31-2013

Thanks for your response.

The CPU/library is the same for the two versions (ippsv8-n°version.dll).

I study the influence of the size of vectors. The gap (in number of CPU clocks) between the executions of two IPP versions is always the same whatever the size of input vector (as if “nop” functions has added in the sort function “ippsSortRadixIndexAscend_32f” of the IPP8.0 version).

I run the same program on another processor (Intel Core i3-2100, 3.091Ghz ). For this processor, the performance of the sort fonction of the IPP8-0 version and the IPP7-1 version is the same.

Possibly, the processor Dual Core E5400 is deprecated for the new version of IPP ?

Pierre

SergeyKostrov · ‎07-31-2013

>>...The vectors are constituted by 1024 random samples and I realize 40,000 executions to obtain these averages.. Could you verify performance of ippsSortRadixIndexAscend_32f functions ( v7.x and v8.x ) on an 8MB array with 1,024 executions?

P_v_ · ‎08-01-2013

I check the performance of sort function on an 8Mb array and I obtaine

for IPP8-0 : 730CPU
for IPP7-1 : 685CPU

These 2 time are always diferent

SergeyKostrov · ‎08-01-2013

>>>>...The vectors are constituted by 1024 random samples... >>... >>...These 2 time are always diferent Try to make your tests deterministic. It means, pre-generate an array of numbers and then use it to measure performance of both functions for all testing iterations. It is by design of many sorting algorithms to complete processing in different amounts of time when different data sets are used ( don't be confused with asymptotic complexity of a sorting algorithm ). In overall, you should have reproducible measurements between tests and performance numbers should not differ for more than +/-( 0.5% - 1.0% ).

SergeyKostrov · ‎08-01-2013

>>I check the performance of sort function on an 8Mb array and I obtaine >> >>- for IPP8-0 : 730CPU >>- for IPP7-1 : 685CPU In that test with random numbers ippsSortRadixIndexAscend_32 in IPP8-0 is ~6.2% slower than IPP7-1. Please repeat tests with the same numbers in the array as I already described.

Igor_A_Intel · ‎08-01-2013

Hi Pierre,

"average" is not right for performance measurements - try to check for "min", please. "Average" includes dll load time and some other OS activities. I've checked both IPP versions with IPP PS (perf system, available in the package) for single threaded static libs - I don't observe any degradation. Your reproducible will be appreciated for more detailed analysis.

regards, Igor

P_v_ · ‎08-01-2013

>>>pre-generate an array of numbers

I try a pre-generate vectors of 1024 samples contained in binary file.

>>>...for single threaded static libs - I don't observe any degradation

My previous test was with dynamic linker.

I test for single thread statics libs and I don't observe degradation for "average" and "min".

Tanks for your helps

Igor_A_Intel · ‎08-01-2013

Pierre,

I guess I know the root of this issue - I think you've linked with dynamic libs installed by default - for 8.0 the default installation contains only single-threaded dynamic libraries (for multi-threaded you should check one more checkbox in the thin-client install) while 7.x has only multi-threaded dlls. This functionality is internally threaded.

regards, Igor

SergeyKostrov · ‎08-01-2013

It is very easy to verify how many threads were created for tests with the function ( in IPP version 7 and 8 ). Just take a look at Windows Task Manager ( Processes property page ). In case of IPP version 7 try to set number of threads to 1, repeat tests and please post results. Thanks.

P_v_ · ‎08-02-2013

I have verified the number of threads and I have

2 threads for IPP7-1 with dynamic linkage and 1 for static linkage
1 threads for IPP8-0 with dynamic linkage and 1 for static linkage

I have set the number of threads to 1 in the case of IPP7-1 in the case of dynamic linkgage with the function ippSetNumThreads.

I observe again a significant difference between two version for the average CPU and the min CPU (around 25% of difference for the both)

Pierre

Igor_A_Intel · ‎08-02-2013

Pierre,

IPP PS (perf system) doesn't show any difference - so could you attach your measuring program - I need some reproducer to understand/analyse the issue.

regards, Igor

P_v_ · ‎08-02-2013

I have tested the sort function "ippsSortRadixIndexAscend_32f" with Perf System.

For IPP7.1, I run the programm with the following command line ps_ipps.exe -r -o -f"ippsSortRadixIndexAscend_32f" -N1. The option -N1 is used to set the number of threads to 1 (as IPP8.0)

For IPP7.1, I run the programm with the following command line ps_ipps.exe -r -o -f"ippsSortRadixIndexAscend_32f"

The results for IPP7-1 are

CPU,Processor supporting Supplemental Streaming SIMD Extension 3 instruction set, 2x2.66 GHz, Max cache size 2048 K
OS,Windows 7 Professional Service Pack 1 (Win32)
Computer,SIC-004
Library,ippSP SSE2 (w7), 7.1.1 (r37466), Sep 27 2012
Start,Fri Aug 02 17:03:25 2013
function,Parm1,Parm2,Parm3,Parm4,Parm5,Parm6,Parm7,Parm8,Comment,Clocks,per,Time (usec),MFlops
ippsSortRadixIndexAscend,32f,-,1024,1,-,-,-,-,nLps=8,64.7,e,24.9,-
ippsSortRadixIndexAscend,32f,-,1024,2,-,-,-,-,nLps=8,56.1,e,21.6,-

The results for IPP8-0 are

CPU,Processor supporting Supplemental Streaming SIMD Extension 3 instruction set, 2x2.66 GHz, Max cache size 2048 K
OS,Windows 7 Professional Service Pack 1 (Win32)
Computer,SIC-004
Library,ippSP SSE2 (w7), 8.0.0 (r40040), May 22 2013
Start,Fri Aug 02 17:18:13 2013
function,Parm1,Parm2,Parm3,Parm4,Parm5,Parm6,Parm7,Parm8,Comment,Clocks,per,Time (usec),MFlops
ippsSortRadixIndexAscend,32f,-,1024,1,-,-,-,-,nLps=8,80.3,e,30.9,-
ippsSortRadixIndexAscend,32f,-,1024,2,-,-,-,-,nLps=8,75.2,e,29,-

(I have tested Perfsys for the two versions with the Copy function of 1024 ipp32f, the results are similar between two versions)

I'm sorry but, I will be out of office for the three weeks with no internet access. I could not ansver.

regards

Pierre