Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Java polynomial approx. faster than C code polynomial approx.

Bernard
Valued Contributor I
429 Views
Hi everybody!
This question resembles slightly my otherthread posted here http://software.intel.com/en-us/forums/showthread.php?t=105474
While porting my library of elementary and special functions from Java to C I implemented polynomial approximation as it was advised me by few posters.Now I use poly approximation in my librarywhere it is applicable.I was interested in performance measurement between the same implementation written in managed code and in native code.To my big surprise java code always executed faster than native code.
After studying asm code and knowing than Intel c++ compiler uses security cookie checking and fills the buffer with 128 int 3 (0xcc) instructionsright after function's prolog.I came to conclusion that this is compiler induced overhead which is responsible for slower execution of C code.
Here are the tests taken from my thread http://software.intel.com/en-us/forums/showthread.php?t=105474
Can anybody help me to understand why native code can be so slow when compared to Java code.

result for native code 1 million iterations.

start value of fastsin(): 39492698 // Native code
end value of fastsin() : 39492760
delta of fastsin() is : 62 millisec
sine is: 0.841470444509448080000000

java -server

C:\\Program Files\\Java\\jdk1.7.0\\bin>java -server SineFunc
start value : 1339596068015
end value : 1339596068045
running time of fastsin() is :30 milisec

java -client

C:\\Program Files\\Java\\jdk1.7.0\\bin>java -client SineFunc
start value : 1339596081083
end value : 1339596081130
running time of fastsin() is :47 milisec

Here is the fastsin() prologue

0000055 push ebp
000018b ec mov ebp, esp
0000381 ec 80 00 00
00 sub esp, 128; 00000080H
0000957 push edi
0000a8d 7d 80 lea edi, DWORD PTR [ebp-128]
0000db9 20 00 00 00 mov ecx, 32; 00000020H
00012b8 cc cc cc cc mov eax, -858993460; ccccccccH
00017f3 ab rep stosd <-- Can be this culprit for slower code execution

And here is the code.Java implementation is identical to this code.

double fastsin(double x){
double sum = 0;
double half_pi,zero_arg;
half_pi = Pi/2;
zero_arg = Zero;

if(x > half_pi){ // simple input checking range 0 return (x-x)/(x-x) ;
}else if (x < zero_arg){
return (x-x)/(x-x);
}else{


double coef1,coef2,coef3,coef4,coef5,coef6,coef7,coef8,coef9,coef10,coef11,rad,sqr;
coef1 = -0.16666666666666666666666666666667;// 1/3!
coef2 = 0.00833333333333333333333333333333;// 1/5!
coef3 = -1.984126984126984126984126984127e-4;// 1/7!
coef4 = 2.7557319223985890652557319223986e-6;// 1/9!
coef5 = -2.5052108385441718775052108385442e-8;// 1/11!
coef6 = 1.6059043836821614599392377170155e-10;// 1/13!
coef7 = -7.6471637318198164759011319857881e-13;// 1/15!
coef8 = 2.8114572543455207631989455830103e-15 ;// 1/17!
coef9 = -8.2206352466243297169559812368723e-18;// 1/19!
coef10 = 1.9572941063391261230847574373505e-20;// 1/21!
coef11 = -3.8681701706306840377169119315228e-23;// 1/23!
rad = x;//
sqr = x*x; //x^2

sum = rad+rad*sqr*(coef1+sqr*(coef2+sqr*(coef3+sqr*(coef4+sqr*(coef5+sqr*(coef6+sqr*(coef7+sqr*(coef8+sqr*(coef9+sqr*(coef10+sqr*(coef11)))))))))));



}
return sum;
}

0 Kudos
11 Replies
SergeyKostrov
Valued Contributor II
429 Views
Hi Iliya,

Quoting iliyapolak
...To my big surprise java code always executed faster than native code. After studying asm code and
knowing than Intel c++ compiler uses
security cookie checking...

[SergeyK] Please see my comment 3 at the end of the post.

...and fills the buffer with 128 int 3 (0xcc)
instructionsright after function's prolog.I came to conclusion that this is compiler induced overhead which
is responsible for slower execution of C code.

[SergeyK] Could you provide a complete list of compiler options? I noticed that Intel C++ compilercodes
are always 2xslower when ALL optimizations are disabled.My point of viewis based on
my test cases verified with Intel, Microsoft, MinGW and Borland C++ compilers.
It is interesting that your C implementation is also 2x slower.

...
result for native code 1 million iterations.

...delta of fastsin() is : 62 millisec...

java -server

...running time of fastsin() is : 30 milisec...

java -client

...running time of fastsin() is : 47 milisec...

Here is the fastsin() prologue

0000055 push ebp
000018b ec mov ebp, esp
0000381 ec 80 00 00
00 sub esp, 128; 00000080H
0000957 push edi
0000a8d 7d 80 lea edi, DWORD PTR [ebp-128]
0000db9 20 00 00 00 mov ecx, 32; 00000020H
00012b8 cc cc cc cc mov eax, -858993460; ccccccccH
00017f3 ab rep stosd <--
Can be this culprit for slower code execution

And here is the code.Java implementation is identical to this code.

double fastsin(double x){
double sum = 0;
double
half_pi, zero_arg;
half_pi = Pi/2;
zero_arg = Zero;

if(x > half_pi){ // simple input checking range 0 return (x-x)/(x-x) ;
}else if (x < zero_arg){
return (x-x)/(x-x);
}else{

double
coef1,coef2,coef3,coef4,coef5,coef6,coef7,coef8,coef9,coef10,coef11,rad,sqr;
coef1 = -0.16666666666666666666666666666667;// 1/3!
coef2 = 0.00833333333333333333333333333333;// 1/5!
coef3 = -1.984126984126984126984126984127e-4;// 1/7!
coef4 = 2.7557319223985890652557319223986e-6;// 1/9!
coef5 = -2.5052108385441718775052108385442e-8;// 1/11!
coef6 = 1.6059043836821614599392377170155e-10;// 1/13!
coef7 = -7.6471637318198164759011319857881e-13;// 1/15!
coef8 = 2.8114572543455207631989455830103e-15 ;// 1/17!
coef9 = -8.2206352466243297169559812368723e-18;// 1/19!
coef10 = 1.9572941063391261230847574373505e-20;// 1/21!
coef11 = -3.8681701706306840377169119315228e-23;// 1/23!
rad = x;//
sqr = x*x; //x^2

sum = rad+rad*sqr*(coef1+sqr*(coef2+sqr*(coef3+sqr*(coef4+sqr*(coef5+sqr*(coef6+sqr*(coef7+sqr*(coef8+sqr*(coef9+sqr*(coef10+sqr*(coef11)))))))))));

}
return sum;
}


Here are a couple of tips:

1.Did you get 62 millisecforC codes in Debug or Release configuration?

2.Every time when you call the 'FastSin' function a memory for15 variables of type 'double' will be allocated
on the stack ( I'bolded' and 'underlined'declaration of all these variables ).

3.As soon as a memory is allocated for these 15 variables it has to be initialized with some default value,
like 0xcccccccc,( it willnever beinitialized with 0 ) andit looks like instruction 'rep stosd' does it.

4.Every time when you call the 'FastSin' functionyou initialize 'coef..' variables with the same constant
values, like '1/3!'. Why wouldn't you declare these 13constants as global?

5.And one more thing, 15 (variables)x 8 (sizeof(double)) is equal to 120. You mentioned that the assembler
codes are initializing some buffer with 128 '0xcc'.

0 Kudos
SergeyKostrov
Valued Contributor II
429 Views
Here isa screenshot that demonstratesa default initialization for '__int32' and '__int64' variables:


0 Kudos
SergeyKostrov
Valued Contributor II
429 Views
Quoting iliyapolak
0000a8d 7d 80 lea edi, DWORD PTR [ebp-128]
0000db9 20 00 00 00 mov ecx, 32; 00000020H
00012b8 cc cc cc cc mov eax, -858993460; ccccccccH
00017f3 ab rep stosd <-- Can be this culprit for slower code execution


I think No.

Best regards,
Sergey

0 Kudos
Bernard
Valued Contributor I
429 Views
Hi Sergey!
Thanks for your answer :)

>>1.Did you get 62 millisecforC codes in Debug or Release configuration?

62 millisec was measured for Debug configuration.
In release mode rep stosd instruction is gone and code is inlined inside the main()'s for-loop

>>3.As soon as a memory is allocated for these 15 variables it has to be initialized with some default value,
like 0xcccccccc

It could be also filled with x86 int3 (0xcc) instruction to force debugger break-in when the code executes out of return address.I read about this behaviour in Chris Eagle book"The Ida Pro Book"

>>4.values, like '1/3!'. Why wouldn't you declare these 13constants as global?

I preffer do not give to this values a global scope.

>>5.You mentioned that the assembler
codes are initializing some buffer with 128
0xcc

Look at this code

00 sub esp, 128; 00000080H<-- here 128 - bytes buffer
0000957 push edi
0000a8d 7d 80 lea edi, DWORD PTR [ebp-128]
0000db9 20 00 00 00 mov ecx, 32; 00000020H
00012b8 cc cc cc cc mov eax, -858993460<-- here
I think these are int3 instructions

This was also recognized by Intel amplifier as source of main hot-spot with 14 instructions executed per iteration and whole this block can add significant overhead of few nanosec.

P.S
Sergeygo to my thread here http://software.intel.com/en-us/forums/showthread.php?t=105474
I uploaded another book on accuracy and stabillity of numerical methods.:)

post #147
0 Kudos
Bernard
Valued Contributor I
429 Views
SergeyK] Could you provide a complete list of compiler options? I noticed that Intel C++ compilercodes
are always 2xslower when ALL optimizations are disabled.My point of viewis based on
my test cases verified with Intel, Microsoft, MinGW and Borland C++ compilers.
It is interesting that your C implementation is also 2x slower


Compiler options for release mode

Zi /nologo /W3 /WX- /O2 /Ob2 /Oi /Ot /Oy /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /GS- /Gy /arch:SSE2 /fp:precise /Zc:wchar_t /Zc:forScope /Fp"Release\inline.c.pch" /FAcs /Fa"Release" /Fo"Release" /Fd"Release\vc100.pdb" /Gd /analyze- /errorReport:queue


Here is the optimization introduced by Microsoft compiler in release code.As you can see whole fastsin() code was inlined inside the main() there is also unrolling of fastsin() argument,but what is very strange that delta for 1e6 iteration measurement was 0.And for 1e7 iterations the result was it31 millisec i.e 3.1 nanosec per eration.Too slow to be true.

The result for 10 million iterations
running time of fastsin() release code is: 31 milisec
fastsin() is: 0.909297421962549370000000



asm code


[bash]; 23 : int main(void){ 00000 55 push ebp 00001 8b ec mov ebp, esp 00003 83 e4 c0 and esp, -64 ; ffffffc0H 00006 83 ec 30 sub esp, 48 ; 00000030H ; 24 : double e1 = 0; ; 25 : double sine; ; 26 : sine = 0; ; 27 : double gam; ; 28 : gam = 0; ; 29 : double fastgam; ; 30 : fastgam = 0; ; 31 : double arg1; ; 32 : arg1 = 1.0f; 00009 f2 0f 10 05 00 00 00 00 movsd xmm0, QWORD PTR _one 00011 53 push ebx 00012 55 push ebp 00013 56 push esi 00014 57 push edi ; 33 : unsigned int start2,end2; ; 34 : start2 = GetTickCount(); 00015 8b 3d 00 00 00 00 mov edi, DWORD PTR __imp__GetTickCount@0 0001b f2 0f 11 44 24 30 movsd QWORD PTR _arg1$[esp+64], xmm0 00021 ff d7 call edi 00023 f2 0f 10 15 00 00 00 00 movsd xmm2, QWORD PTR __real@3e7ad7f2a0000000 0002b f2 0f 10 25 00 00 00 00 movsd xmm4, QWORD PTR __real@3b4761b41316381a 00033 f2 0f 10 2d 00 00 00 00 movsd xmm5, QWORD PTR __real@3bd71b8ef6dcf572 0003b f2 0f 10 35 00 00 00 00 movsd xmm6, QWORD PTR __real@3c62f49b46814157 00043 f2 0f 10 5c 24 30 movsd xmm3, QWORD PTR _arg1$[esp+64] 00049 8b f0 mov esi, eax 0004b b8 40 42 0f 00 mov eax, 1000000 ; 000f4240H $LL9@main: ; 35 : for(int i2 = 0;i2<10000000;i2++){ 00050 48 dec eax ; 36 : arg1 += 0.0000001f; 00051 f2 0f 58 da addsd xmm3, xmm2 00055 f2 0f 58 da addsd xmm3, xmm2 00059 f2 0f 58 da addsd xmm3, xmm2 0005d f2 0f 58 da addsd xmm3, xmm2 00061 f2 0f 58 da addsd xmm3, xmm2 00065 f2 0f 58 da addsd xmm3, xmm2 00069 f2 0f 58 da addsd xmm3, xmm2 0006d f2 0f 58 da addsd xmm3, xmm2 00071 f2 0f 58 da addsd xmm3, xmm2 00075 f2 0f 58 da addsd xmm3, xmm2 ; 37 : sine = fastsin(arg1); 00079 66 0f 28 cb movapd xmm1, xmm3 0007d f2 0f 59 cb mulsd xmm1, xmm3 00081 66 0f 28 f9 movapd xmm7, xmm1 00085 f2 0f 59 fc mulsd xmm7, xmm4 00089 66 0f 28 c5 movapd xmm0, xmm5 0008d f2 0f 5c c7 subsd xmm0, xmm7 00091 f2 0f 59 c1 mulsd xmm0, xmm1 00095 f2 0f 5c c6 subsd xmm0, xmm6 00099 f2 0f 59 c1 mulsd xmm0, xmm1 0009d f2 0f 58 05 00 00 00 00 addsd xmm0, QWORD PTR __real@3ce952c77030ad4a 000a5 f2 0f 59 c1 mulsd xmm0, xmm1 000a9 f2 0f 5c 05 00 00 00 00 subsd xmm0, QWORD PTR __real@3d6ae7f3e733b81f 000b1 f2 0f 59 c1 mulsd xmm0, xmm1 000b5 f2 0f 58 05 00 00 00 00 addsd xmm0, QWORD PTR __real@3de6124613a86d09 000bd f2 0f 59 c1 mulsd xmm0, xmm1 000c1 f2 0f 5c 05 00 00 00 00 subsd xmm0, QWORD PTR __real@3e5ae64567f544e4 000c9 f2 0f 59 c1 mulsd xmm0, xmm1 000cd f2 0f 58 05 00 00 00 00 addsd xmm0, QWORD PTR __real@3ec71de3a556c734 000d5 f2 0f 59 c1 mulsd xmm0, xmm1 000d9 f2 0f 5c 05 00 00 00 00 subsd xmm0, QWORD PTR __real@3f2a01a01a01a01a 000e1 f2 0f 59 c1 mulsd xmm0, xmm1 000e5 f2 0f 58 05 00 00 00 00 addsd xmm0, QWORD PTR __real@3f81111111111111 000ed f2 0f 59 c1 mulsd xmm0, xmm1 000f1 f2 0f 5c 05 00 00 00 00 subsd xmm0, QWORD PTR __real@3fc5555555555555 000f9 f2 0f 59 cb mulsd xmm1, xmm3 000fd f2 0f 59 c1 mulsd xmm0, xmm1 00101 f2 0f 58 c3 addsd xmm0, xmm3 00105 f2 0f 11 44 24 30 movsd QWORD PTR _sine$[esp+64], xmm0 0010b 0f 85 3f ff ff ff jne $LL9main [/bash]
Btw Sergey Do You preffer to continue our discussion in the Intel AVX and CPU instructions forum, because I posted there asm listings and measurements result moreover there are also very interesting responses from bronxzv here http://software.intel.com/en-us/forums/showthread.php?t=105474

0 Kudos
SergeyKostrov
Valued Contributor II
429 Views
Quoting iliyapolak
...Do You preffer to continue our discussion in the Intel AVX and CPU instructions forum, because I posted there asm listings and measurements result moreover there are also very interesting responses from bronxzv here http://software.intel.com/en-us/forums/showthread.php?t=105474


I will startreading allposts I missed ( from Page #9 to the Last ) soon.

I think there is nothing wrongwith havingboth threads. However,the threadin the'Intel AVX and CPU' forum
has grown significantly. You could also make a note that all the rest discussions will be done, for example,in
the 'Software Tuning, Performance Optimization' forum.

So, it is up to you to decide what thread has to be 'alive' for discussions.

Best regards,
Sergey

0 Kudos
Bernard
Valued Contributor I
429 Views
Hi everybody!

I decided to continue this discussion in this thread http://software.intel.com/en-us/forums/showpost.php?p=187714
0 Kudos
Maycon_Oliveira
Beginner
429 Views
Have you make the test with JRockit JVM?
0 Kudos
Maycon_Oliveira
Beginner
429 Views
...and change the test with different values for -XX:MinYoung and others tuning variables...
0 Kudos
Bernard
Valued Contributor I
429 Views

...and change the test with different values for -XX:MinYoung and others tuning variables


The tests were made with Sun(Oracle) JVM with two settingsclient and server.Code was compiled by incremential EclipseJDT compiler and ran by Java RuntimeEnvironment.

java version "1.7.0"
Java SE Runtime Environment (build 1.7.0-b147)
Java HotSpot Client VM (build 21.0-b17, mixed mode)

Please bear in mind that the results posted in this thread are outdated because the C code was compiled in Debug mode hence highly optimizedbyJIT compiler code could easily outperform much slowernative code.
Please look here post #158 http://software.intel.com/en-us/forums/showthread.php?t=105474

And here are the results of aggresively optimized native code:

Tested today fastsin() 1e6 iterations and the result was 15 millisec i.e ~33.39 cycles per iterationfor my CPU.

results

start val of fastsin() 13214314
end val of fastsin() 13214329
running time of fastsin() release code is: 15 millisec
fastsin() is: 0.891207360591512180000000
0 Kudos
Bernard
Valued Contributor I
429 Views
No!
The compilerwas incrementialEclipse JDT.The tests were made with Sun JVM.
0 Kudos
Reply