<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Java polynomial approx. faster than C code polynomial approx. in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802119#M687</link>
    <description>Hi Sergey!&lt;BR /&gt;Thanks for your answer :)&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;&amp;gt;&amp;gt;1.Did you get 62 millisecforC codes in Debug or Release configuration?&lt;BR /&gt;&lt;/STRONG&gt;&lt;BR /&gt;62 millisec was measured for Debug configuration.&lt;BR /&gt;In release mode &lt;SPAN style="text-decoration: underline;"&gt;rep stosd&lt;/SPAN&gt; instruction is gone and code is inlined inside the main()'s for-loop&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;&amp;gt;&amp;gt;3.As soon as a memory is allocated for these 15 variables it has to be initialized with some default value,&lt;BR /&gt; like 0xcccccccc&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;It could be also filled with x86 int3 (0xcc) instruction to force debugger break-in when the code executes out of return address.I read about this behaviour in Chris Eagle book"The Ida Pro Book"&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;&amp;gt;&amp;gt;4.values, like '1/3!'. Why wouldn't you declare these 13constants as global?&lt;BR /&gt;&lt;/STRONG&gt;&lt;BR /&gt;I preffer do not give to this values a global scope.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;&amp;gt;&amp;gt;5.You mentioned that the assembler&lt;BR /&gt; codes are initializing some buffer with 128&lt;/STRONG&gt; &lt;STRONG&gt;0xcc&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;Look at this code&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;00 sub esp, 128; 00000080H&amp;lt;-- here &lt;/STRONG&gt;128 - bytes buffer&lt;BR /&gt;&lt;STRONG&gt; 0000957 push edi&lt;BR /&gt; 0000a8d 7d 80 lea edi, DWORD PTR [ebp-128]&lt;BR /&gt; 0000db9 20 00 00 00 mov ecx, 32; 00000020H&lt;BR /&gt; 00012b8 cc cc cc cc mov eax, -858993460&amp;lt;-- here &lt;/STRONG&gt;I think these are int3 instructions&lt;BR /&gt;&lt;STRONG&gt;&lt;BR /&gt;&lt;/STRONG&gt;This was also recognized by Intel amplifier as source of main hot-spot with 14 instructions executed per iteration and whole this block can add significant overhead of few nanosec.&lt;BR /&gt;&lt;BR /&gt;P.S&lt;BR /&gt;Sergeygo to my thread here &lt;A href="http://software.intel.com/en-us/forums/showthread.php?t=105474"&gt;http://software.intel.com/en-us/forums/showthread.php?t=105474&lt;/A&gt;&lt;BR /&gt;I uploaded another book on accuracy and stabillity of numerical methods.:)&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;post #147&lt;/STRONG&gt;&lt;BR /&gt;</description>
    <pubDate>Sat, 16 Jun 2012 19:27:07 GMT</pubDate>
    <dc:creator>Bernard</dc:creator>
    <dc:date>2012-06-16T19:27:07Z</dc:date>
    <item>
      <title>Java polynomial approx. faster than C code polynomial approx.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802115#M683</link>
      <description>Hi everybody!&lt;BR /&gt;This question resembles slightly my otherthread posted here &lt;A href="http://software.intel.com/en-us/forums/showthread.php?t=105474"&gt;http://software.intel.com/en-us/forums/showthread.php?t=105474&lt;/A&gt;&lt;BR /&gt;While porting my library of elementary and special functions from Java to C I implemented polynomial approximation as it was advised me by few posters.Now I use poly approximation in my librarywhere it is applicable.I was interested in performance measurement between the same implementation written in managed code and in native code.To my big surprise java code always executed faster than native code.&lt;BR /&gt;After studying asm code and knowing than Intel c++ compiler uses security cookie checking and fills the buffer with 128 int 3 (0xcc) instructionsright after function's prolog.I came to conclusion that this is compiler induced overhead which is responsible for slower execution of C code.&lt;BR /&gt;Here are the tests taken from my thread &lt;A href="http://software.intel.com/en-us/forums/showthread.php?t=105474"&gt;http://software.intel.com/en-us/forums/showthread.php?t=105474&lt;/A&gt;&lt;BR /&gt;Can anybody help me to understand why native code can be so slow when compared to Java code.&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN style="text-decoration: underline;"&gt;result for native code 1 million iterations&lt;/SPAN&gt;.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;start value of fastsin(): 39492698 // Native code&lt;BR /&gt;end value of fastsin() : 39492760&lt;BR /&gt;delta of fastsin() is : 62 millisec&lt;BR /&gt;sine is: 0.841470444509448080000000&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;java -server&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;C:\\Program Files\\Java\\jdk1.7.0\\bin&amp;gt;java -server SineFunc&lt;BR /&gt;start value : 1339596068015&lt;BR /&gt;end value : 1339596068045&lt;BR /&gt;running time of fastsin() is :30 milisec&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;java -client &lt;BR /&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;STRONG&gt;C:\\Program Files\\Java\\jdk1.7.0\\bin&amp;gt;java -client SineFunc&lt;BR /&gt;start value : 1339596081083&lt;BR /&gt;end value : 1339596081130&lt;BR /&gt;running time of fastsin() is :47 milisec&lt;BR /&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;SPAN style="text-decoration: underline;"&gt;Here is the fastsin() prologue&lt;BR /&gt;&lt;/SPAN&gt;&lt;BR /&gt;0000055 push ebp&lt;BR /&gt; 000018b ec mov ebp, esp&lt;BR /&gt; 0000381 ec 80 00 00&lt;BR /&gt;00 sub esp, 128; 00000080H&lt;BR /&gt; 0000957 push edi&lt;BR /&gt; 0000a8d 7d 80 lea edi, DWORD PTR [ebp-128]&lt;BR /&gt; 0000db9 20 00 00 00 mov ecx, 32; 00000020H&lt;BR /&gt; 00012b8 cc cc cc cc mov eax, -858993460; ccccccccH&lt;BR /&gt; 00017f3 ab rep stosd &lt;STRONG&gt;&amp;lt;--&lt;/STRONG&gt; &lt;STRONG&gt;Can be this culprit for slower code execution&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;And here is the code.Java implementation is identical to this code.&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;P&gt; double fastsin(double x){&lt;BR /&gt; double sum = 0;&lt;BR /&gt; double half_pi,zero_arg; &lt;BR /&gt;half_pi = Pi/2;&lt;BR /&gt;zero_arg = Zero;&lt;BR /&gt;&lt;BR /&gt; if(x &amp;gt; half_pi){ // simple input checking range 0&lt;X&gt;&lt;PI&gt;&lt;/PI&gt; return (x-x)/(x-x) ;&lt;BR /&gt; }else if (x &amp;lt; zero_arg){&lt;BR /&gt; return (x-x)/(x-x);&lt;BR /&gt; }else{&lt;BR /&gt; &lt;/X&gt;&lt;/P&gt;&lt;P&gt; &lt;BR /&gt; double coef1,coef2,coef3,coef4,coef5,coef6,coef7,coef8,coef9,coef10,coef11,rad,sqr;&lt;BR /&gt; coef1 = -0.16666666666666666666666666666667;// 1/3!&lt;BR /&gt; coef2 = 0.00833333333333333333333333333333;// 1/5!&lt;BR /&gt; coef3 = -1.984126984126984126984126984127e-4;// 1/7!&lt;BR /&gt; coef4 = 2.7557319223985890652557319223986e-6;// 1/9!&lt;BR /&gt; coef5 = -2.5052108385441718775052108385442e-8;// 1/11!&lt;BR /&gt; coef6 = 1.6059043836821614599392377170155e-10;// 1/13!&lt;BR /&gt; coef7 = -7.6471637318198164759011319857881e-13;// 1/15!&lt;BR /&gt; coef8 = 2.8114572543455207631989455830103e-15 ;// 1/17!&lt;BR /&gt; coef9 = -8.2206352466243297169559812368723e-18;// 1/19!&lt;BR /&gt; coef10 = 1.9572941063391261230847574373505e-20;// 1/21!&lt;BR /&gt; coef11 = -3.8681701706306840377169119315228e-23;// 1/23!&lt;BR /&gt; rad = x;//&lt;BR /&gt; sqr = x*x; //x^2&lt;BR /&gt; &lt;BR /&gt; sum = rad+rad*sqr*(coef1+sqr*(coef2+sqr*(coef3+sqr*(coef4+sqr*(coef5+sqr*(coef6+sqr*(coef7+sqr*(coef8+sqr*(coef9+sqr*(coef10+sqr*(coef11)))))))))));&lt;BR /&gt; &lt;/P&gt;&lt;P&gt; &lt;BR /&gt; &lt;BR /&gt; &lt;/P&gt;&lt;P&gt; }&lt;BR /&gt; return sum;&lt;BR /&gt; }&lt;/P&gt;</description>
      <pubDate>Thu, 14 Jun 2012 10:42:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802115#M683</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-06-14T10:42:51Z</dc:date>
    </item>
    <item>
      <title>Java polynomial approx. faster than C code polynomial approx.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802116#M684</link>
      <description>&lt;DIV id="tiny_quote"&gt;&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Hi Iliya,&lt;BR /&gt;&lt;BR /&gt;Quoting &lt;A jquery1339826567437="58" rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=563040" href="https://community.intel.com/en-us/profile/563040/" class="basic"&gt;iliyapolak&lt;/A&gt;&lt;/DIV&gt;&lt;DIV style="background-color: #e5e5e5; margin-left: 2px; margin-right: 2px; border: 1px inset; padding: 5px;"&gt;&lt;EM&gt;...To my big surprise java code always executed faster than native code. After studying asm code and&lt;BR /&gt;knowing than Intel c++ compiler uses &lt;/EM&gt;&lt;STRONG&gt;security cookie&lt;/STRONG&gt;&lt;EM&gt; checking...&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt; [&lt;STRONG&gt;SergeyK&lt;/STRONG&gt;] Please see my comment 3 at the end of the post.&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;...and fills the buffer with 128 int 3 (0xcc)&lt;BR /&gt;instructionsright after function's prolog.I came to conclusion that this is compiler induced overhead which&lt;BR /&gt;is responsible for slower execution of C code.&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt; [&lt;STRONG&gt;SergeyK&lt;/STRONG&gt;] Could you provide a complete list of compiler options? I noticed that Intel C++ compilercodes&lt;BR /&gt; are always 2xslower when ALL optimizations are disabled.My point of viewis based on&lt;BR /&gt; my test cases verified with Intel, Microsoft, MinGW and Borland C++ compilers.&lt;BR /&gt; It is interesting that your C implementation is also 2x slower.&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;...&lt;BR /&gt;&lt;/EM&gt;&lt;EM&gt;&lt;SPAN style="text-decoration: underline;"&gt;result for native code 1 million iterations&lt;/SPAN&gt;.&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;EM&gt;&lt;STRONG&gt;...delta of fastsin() is : 62 millisec...&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;java -server&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;STRONG&gt;&lt;EM&gt;...running time of fastsin() is : 30 milisec...&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;/STRONG&gt;&lt;EM&gt;&lt;SPAN style="text-decoration: underline;"&gt;java -client &lt;BR /&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;EM&gt;&lt;STRONG&gt;...running time of fastsin() is : 47 milisec...&lt;BR /&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;EM&gt;&lt;SPAN style="text-decoration: underline;"&gt;Here is the fastsin() prologue&lt;BR /&gt;&lt;/SPAN&gt;&lt;BR /&gt;0000055 push ebp&lt;BR /&gt; 000018b ec mov ebp, esp&lt;BR /&gt; 0000381 ec 80 00 00&lt;BR /&gt;00 sub esp, 128; 00000080H&lt;BR /&gt; 0000957 push edi&lt;BR /&gt; 0000a8d 7d 80 lea edi, DWORD PTR [ebp-128]&lt;BR /&gt; 0000db9 20 00 00 00 mov ecx, 32; 00000020H&lt;BR /&gt; 00012b8 cc cc cc cc mov eax, -858993460; ccccccccH&lt;BR /&gt; 00017f3 ab rep stosd &lt;STRONG&gt;&amp;lt;--&lt;/STRONG&gt; &lt;/EM&gt;&lt;STRONG&gt;&lt;EM&gt;Can be this culprit for slower code execution&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;/STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;&lt;EM&gt;And here is the code.Java implementation is identical to this code.&lt;BR /&gt;&lt;BR /&gt;&lt;/EM&gt;&lt;/SPAN&gt;&lt;P&gt;&lt;EM&gt; double fastsin(double x){&lt;BR /&gt; double sum = 0;&lt;BR /&gt; double &lt;/EM&gt;&lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;half_pi&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;EM&gt;, &lt;/EM&gt;&lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;zero_arg&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;EM&gt;; &lt;BR /&gt;half_pi = Pi/2;&lt;BR /&gt;zero_arg = Zero;&lt;BR /&gt;&lt;BR /&gt; if(x &amp;gt; half_pi){ // simple input checking range 0&lt;X&gt;&lt;PI&gt;&lt;/PI&gt; return (x-x)/(x-x) ;&lt;BR /&gt; }else if (x &amp;lt; zero_arg){&lt;BR /&gt; return (x-x)/(x-x);&lt;BR /&gt; }else{&lt;BR /&gt;&lt;BR /&gt; double&lt;/X&gt;&lt;/EM&gt; &lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;coef1&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;EM&gt;,&lt;/EM&gt;&lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;coef2&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;EM&gt;,&lt;/EM&gt;&lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;coef3&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;EM&gt;,&lt;/EM&gt;&lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;coef4&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;EM&gt;,&lt;/EM&gt;&lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;coef5&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;EM&gt;,&lt;/EM&gt;&lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;coef6&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;EM&gt;,&lt;/EM&gt;&lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;coef7&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;EM&gt;,&lt;/EM&gt;&lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;coef8&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;EM&gt;,&lt;/EM&gt;&lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;coef9&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;EM&gt;,&lt;/EM&gt;&lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;coef10&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;EM&gt;,&lt;/EM&gt;&lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;coef11&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;EM&gt;,&lt;/EM&gt;&lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;rad&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;EM&gt;,&lt;/EM&gt;&lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;sqr&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;EM&gt;;&lt;BR /&gt; coef1 = -0.16666666666666666666666666666667;// 1/3!&lt;BR /&gt; coef2 = 0.00833333333333333333333333333333;// 1/5!&lt;BR /&gt; coef3 = -1.984126984126984126984126984127e-4;// 1/7!&lt;BR /&gt; coef4 = 2.7557319223985890652557319223986e-6;// 1/9!&lt;BR /&gt; coef5 = -2.5052108385441718775052108385442e-8;// 1/11!&lt;BR /&gt; coef6 = 1.6059043836821614599392377170155e-10;// 1/13!&lt;BR /&gt; coef7 = -7.6471637318198164759011319857881e-13;// 1/15!&lt;BR /&gt; coef8 = 2.8114572543455207631989455830103e-15 ;// 1/17!&lt;BR /&gt; coef9 = -8.2206352466243297169559812368723e-18;// 1/19!&lt;BR /&gt; coef10 = 1.9572941063391261230847574373505e-20;// 1/21!&lt;BR /&gt; coef11 = -3.8681701706306840377169119315228e-23;// 1/23!&lt;BR /&gt; rad = x;//&lt;BR /&gt; sqr = x*x; //x^2&lt;BR /&gt; &lt;BR /&gt; sum = rad+rad*sqr*(coef1+sqr*(coef2+sqr*(coef3+sqr*(coef4+sqr*(coef5+sqr*(coef6+sqr*(coef7+sqr*(coef8+sqr*(coef9+sqr*(coef10+sqr*(coef11)))))))))));&lt;BR /&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt; }&lt;BR /&gt; return sum;&lt;BR /&gt; }&lt;/EM&gt;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;BR /&gt;Here are a couple of tips:&lt;BR /&gt;&lt;BR /&gt;1.Did you get &lt;STRONG&gt;62&lt;/STRONG&gt; millisecforC codes in Debug or Release configuration?&lt;BR /&gt;&lt;BR /&gt;2.Every time when you call the '&lt;STRONG&gt;FastSin&lt;/STRONG&gt;' function a memory for&lt;STRONG&gt;15&lt;/STRONG&gt; variables of type '&lt;STRONG&gt;double&lt;/STRONG&gt;' will be allocated&lt;BR /&gt; on the stack ( I'&lt;STRONG&gt;bolded&lt;/STRONG&gt;' and '&lt;SPAN style="text-decoration: underline;"&gt;underlined&lt;/SPAN&gt;'declaration of all these variables ).&lt;BR /&gt;&lt;BR /&gt;3.As soon as a memory is allocated for these &lt;STRONG&gt;15&lt;/STRONG&gt; variables it has to be initialized with some default value,&lt;BR /&gt; like &lt;STRONG&gt;0xcccccccc&lt;/STRONG&gt;,( it willnever beinitialized with &lt;STRONG&gt;0&lt;/STRONG&gt; ) andit looks like instruction 'rep stosd' does it.&lt;BR /&gt;&lt;BR /&gt;4.Every time when you call the '&lt;STRONG&gt;FastSin&lt;/STRONG&gt;' functionyou initialize '&lt;STRONG&gt;coef..&lt;/STRONG&gt;' variables with the same constant&lt;BR /&gt;  values, like '1/3!'. Why wouldn't you declare these &lt;STRONG&gt;13&lt;/STRONG&gt;constants as global?&lt;BR /&gt;&lt;BR /&gt;5.And one more thing, &lt;STRONG&gt;15&lt;/STRONG&gt; (variables)&lt;STRONG&gt;x&lt;/STRONG&gt; &lt;STRONG&gt;8&lt;/STRONG&gt; (sizeof(double)) is equal to &lt;STRONG&gt;120&lt;/STRONG&gt;. You mentioned that the assembler&lt;BR /&gt; codes are initializing some buffer with 128 '&lt;STRONG&gt;0xcc&lt;/STRONG&gt;'.&lt;/P&gt;</description>
      <pubDate>Sat, 16 Jun 2012 06:43:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802116#M684</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-06-16T06:43:02Z</dc:date>
    </item>
    <item>
      <title>Java polynomial approx. faster than C code polynomial approx.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802117#M685</link>
      <description>Here isa screenshot that demonstratesa default initialization for '&lt;STRONG&gt;__int32&lt;/STRONG&gt;' and '&lt;STRONG&gt;__int64&lt;/STRONG&gt;' variables:&lt;BR /&gt;&lt;BR /&gt;&lt;span class="lia-inline-image-display-wrapper"&gt;&lt;img src="https://community.intel.com/skins/images/7B13F55A7CE623EF42E69096FA81A3A1/2021_redesign/images/image_not_found.png" /&gt;&lt;/span&gt;&lt;BR /&gt;</description>
      <pubDate>Sat, 16 Jun 2012 06:43:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802117#M685</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-06-16T06:43:29Z</dc:date>
    </item>
    <item>
      <title>Java polynomial approx. faster than C code polynomial approx.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802118#M686</link>
      <description>&lt;DIV id="tiny_quote"&gt;&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A jquery1339829516843="58" rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=563040" href="https://community.intel.com/en-us/profile/563040/" class="basic"&gt;iliyapolak&lt;/A&gt;&lt;/DIV&gt;&lt;DIV style="background-color: #e5e5e5; margin-left: 2px; margin-right: 2px; border: 1px inset; padding: 5px;"&gt;&lt;I&gt; 0000a8d 7d 80 lea edi, DWORD PTR [ebp-128]&lt;BR /&gt; 0000db9 20 00 00 00 mov ecx, 32; 00000020H&lt;BR /&gt; 00012b8 cc cc cc cc mov eax, -858993460; ccccccccH&lt;BR /&gt; 00017f3 ab rep stosd &lt;STRONG&gt;&amp;lt;--&lt;/STRONG&gt; &lt;STRONG&gt;Can be this culprit for slower code execution&lt;BR /&gt;&lt;/STRONG&gt;&lt;/I&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;BR /&gt;I think No.&lt;BR /&gt;&lt;BR /&gt;Best regards,&lt;BR /&gt;Sergey&lt;/P&gt;</description>
      <pubDate>Sat, 16 Jun 2012 06:54:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802118#M686</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-06-16T06:54:24Z</dc:date>
    </item>
    <item>
      <title>Java polynomial approx. faster than C code polynomial approx.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802119#M687</link>
      <description>Hi Sergey!&lt;BR /&gt;Thanks for your answer :)&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;&amp;gt;&amp;gt;1.Did you get 62 millisecforC codes in Debug or Release configuration?&lt;BR /&gt;&lt;/STRONG&gt;&lt;BR /&gt;62 millisec was measured for Debug configuration.&lt;BR /&gt;In release mode &lt;SPAN style="text-decoration: underline;"&gt;rep stosd&lt;/SPAN&gt; instruction is gone and code is inlined inside the main()'s for-loop&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;&amp;gt;&amp;gt;3.As soon as a memory is allocated for these 15 variables it has to be initialized with some default value,&lt;BR /&gt; like 0xcccccccc&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;It could be also filled with x86 int3 (0xcc) instruction to force debugger break-in when the code executes out of return address.I read about this behaviour in Chris Eagle book"The Ida Pro Book"&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;&amp;gt;&amp;gt;4.values, like '1/3!'. Why wouldn't you declare these 13constants as global?&lt;BR /&gt;&lt;/STRONG&gt;&lt;BR /&gt;I preffer do not give to this values a global scope.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;&amp;gt;&amp;gt;5.You mentioned that the assembler&lt;BR /&gt; codes are initializing some buffer with 128&lt;/STRONG&gt; &lt;STRONG&gt;0xcc&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;Look at this code&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;00 sub esp, 128; 00000080H&amp;lt;-- here &lt;/STRONG&gt;128 - bytes buffer&lt;BR /&gt;&lt;STRONG&gt; 0000957 push edi&lt;BR /&gt; 0000a8d 7d 80 lea edi, DWORD PTR [ebp-128]&lt;BR /&gt; 0000db9 20 00 00 00 mov ecx, 32; 00000020H&lt;BR /&gt; 00012b8 cc cc cc cc mov eax, -858993460&amp;lt;-- here &lt;/STRONG&gt;I think these are int3 instructions&lt;BR /&gt;&lt;STRONG&gt;&lt;BR /&gt;&lt;/STRONG&gt;This was also recognized by Intel amplifier as source of main hot-spot with 14 instructions executed per iteration and whole this block can add significant overhead of few nanosec.&lt;BR /&gt;&lt;BR /&gt;P.S&lt;BR /&gt;Sergeygo to my thread here &lt;A href="http://software.intel.com/en-us/forums/showthread.php?t=105474"&gt;http://software.intel.com/en-us/forums/showthread.php?t=105474&lt;/A&gt;&lt;BR /&gt;I uploaded another book on accuracy and stabillity of numerical methods.:)&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;post #147&lt;/STRONG&gt;&lt;BR /&gt;</description>
      <pubDate>Sat, 16 Jun 2012 19:27:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802119#M687</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-06-16T19:27:07Z</dc:date>
    </item>
    <item>
      <title>Java polynomial approx. faster than C code polynomial approx.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802120#M688</link>
      <description>&lt;STRONG&gt;SergeyK&lt;/STRONG&gt;] Could you provide a complete list of compiler options? I noticed that Intel C++ compilercodes&lt;BR /&gt; are always 2xslower when ALL optimizations are disabled.My point of viewis based on&lt;BR /&gt; my test cases verified with Intel, Microsoft, MinGW and Borland C++ compilers.&lt;BR /&gt; It is interesting that your C implementation is also 2x slower&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Compiler options for release mode&lt;BR /&gt;&lt;BR /&gt;Zi /nologo /W3 /WX- /O2 /Ob2 /Oi /Ot /Oy /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /GS- /Gy /arch:SSE2 /fp:precise /Zc:wchar_t /Zc:forScope /Fp"Release\inline.c.pch" /FAcs /Fa"Release" /Fo"Release" /Fd"Release\vc100.pdb" /Gd /analyze- /errorReport:queue&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Here is the optimization introduced by Microsoft compiler in release code.As you can see whole fastsin() code was inlined inside the main() there is also unrolling of fastsin() argument,but what is very strange that &lt;SPAN style="text-decoration: underline;"&gt;delta&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;for 1e6 iteration measurement was 0&lt;/SPAN&gt;.And for 1e7 iterations the result was it&lt;SPAN style="text-decoration: underline;"&gt;31 millisec i.e 3.1 nanosec per&lt;/SPAN&gt; &lt;SPAN style="text-decoration: underline;"&gt;eration&lt;/SPAN&gt;.Too slow to be true.&lt;BR /&gt;&lt;BR /&gt;The result for 10 million iterations&lt;BR /&gt;&lt;STRONG&gt;running time of fastsin() release code is: 31 milisec&lt;BR /&gt;fastsin() is: 0.909297421962549370000000&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;asm code&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;[bash]; 23   :  int main(void){

  00000	55		 push	 ebp
  00001	8b ec		 mov	 ebp, esp
  00003	83 e4 c0	 and	 esp, -64		; ffffffc0H
  00006	83 ec 30	 sub	 esp, 48			; 00000030H

; 24   : 	double e1 = 0;
; 25   : 	 double sine;
; 26   : 	 sine = 0;
; 27   : 	double gam;
; 28   : 	gam = 0;
; 29   : 	double fastgam;
; 30   : 	fastgam = 0;
; 31   : 	 double arg1;
; 32   : 	 arg1 = 1.0f;

  00009	f2 0f 10 05 00
	00 00 00	 movsd	 xmm0, QWORD PTR _one
  00011	53		 push	 ebx
  00012	55		 push	 ebp
  00013	56		 push	 esi
  00014	57		 push	 edi

; 33   : 	unsigned int start2,end2;
; 34   : 	 start2 = GetTickCount();

  00015	8b 3d 00 00 00
	00		 mov	 edi, DWORD PTR __imp__GetTickCount@0
  0001b	f2 0f 11 44 24
	30		 movsd	 QWORD PTR _arg1$[esp+64], xmm0
  00021	ff d7		 call	 edi
  00023	f2 0f 10 15 00
	00 00 00	 movsd	 xmm2, QWORD PTR __real@3e7ad7f2a0000000
  0002b	f2 0f 10 25 00
	00 00 00	 movsd	 xmm4, QWORD PTR __real@3b4761b41316381a
  00033	f2 0f 10 2d 00
	00 00 00	 movsd	 xmm5, QWORD PTR __real@3bd71b8ef6dcf572
  0003b	f2 0f 10 35 00
	00 00 00	 movsd	 xmm6, QWORD PTR __real@3c62f49b46814157
  00043	f2 0f 10 5c 24
	30		 movsd	 xmm3, QWORD PTR _arg1$[esp+64]
  00049	8b f0		 mov	 esi, eax
  0004b	b8 40 42 0f 00	 mov	 eax, 1000000		; 000f4240H
$LL9@main:

; 35   : 	 for(int i2 = 0;i2&amp;lt;10000000;i2++){

  00050	48		 dec	 eax

; 36   : 		 arg1 += 0.0000001f;

  00051	f2 0f 58 da	 addsd	 xmm3, xmm2
  00055	f2 0f 58 da	 addsd	 xmm3, xmm2
  00059	f2 0f 58 da	 addsd	 xmm3, xmm2
  0005d	f2 0f 58 da	 addsd	 xmm3, xmm2
  00061	f2 0f 58 da	 addsd	 xmm3, xmm2
  00065	f2 0f 58 da	 addsd	 xmm3, xmm2
  00069	f2 0f 58 da	 addsd	 xmm3, xmm2
  0006d	f2 0f 58 da	 addsd	 xmm3, xmm2
  00071	f2 0f 58 da	 addsd	 xmm3, xmm2
  00075	f2 0f 58 da	 addsd	 xmm3, xmm2

; 37   : 		 sine = fastsin(arg1);

  00079	66 0f 28 cb	 movapd	 xmm1, xmm3
  0007d	f2 0f 59 cb	 mulsd	 xmm1, xmm3
  00081	66 0f 28 f9	 movapd	 xmm7, xmm1
  00085	f2 0f 59 fc	 mulsd	 xmm7, xmm4
  00089	66 0f 28 c5	 movapd	 xmm0, xmm5
  0008d	f2 0f 5c c7	 subsd	 xmm0, xmm7
  00091	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  00095	f2 0f 5c c6	 subsd	 xmm0, xmm6
  00099	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  0009d	f2 0f 58 05 00
	00 00 00	 addsd	 xmm0, QWORD PTR __real@3ce952c77030ad4a
  000a5	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  000a9	f2 0f 5c 05 00
	00 00 00	 subsd	 xmm0, QWORD PTR __real@3d6ae7f3e733b81f
  000b1	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  000b5	f2 0f 58 05 00
	00 00 00	 addsd	 xmm0, QWORD PTR __real@3de6124613a86d09
  000bd	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  000c1	f2 0f 5c 05 00
	00 00 00	 subsd	 xmm0, QWORD PTR __real@3e5ae64567f544e4
  000c9	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  000cd	f2 0f 58 05 00
	00 00 00	 addsd	 xmm0, QWORD PTR __real@3ec71de3a556c734
  000d5	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  000d9	f2 0f 5c 05 00
	00 00 00	 subsd	 xmm0, QWORD PTR __real@3f2a01a01a01a01a
  000e1	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  000e5	f2 0f 58 05 00
	00 00 00	 addsd	 xmm0, QWORD PTR __real@3f81111111111111
  000ed	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  000f1	f2 0f 5c 05 00
	00 00 00	 subsd	 xmm0, QWORD PTR __real@3fc5555555555555
  000f9	f2 0f 59 cb	 mulsd	 xmm1, xmm3
  000fd	f2 0f 59 c1	 mulsd	 xmm0, xmm1
  00101	f2 0f 58 c3	 addsd	 xmm0, xmm3
  00105	f2 0f 11 44 24
	30		 movsd	 QWORD PTR _sine$[esp+64], xmm0
  0010b	0f 85 3f ff ff
	ff		 jne	 &lt;A href="mailto:$LL9@main"&gt;$LL9main&lt;/A&gt;  [/bash]&lt;BR /&gt;&lt;STRONG&gt;Btw Sergey Do You preffer to continue our discussion in the Intel AVX and CPU instructions forum, because I posted there asm listings and measurements result moreover there are also very interesting responses from bronxzv here&lt;/STRONG&gt; &lt;A href="http://software.intel.com/en-us/forums/showthread.php?t=105474"&gt;http://software.intel.com/en-us/forums/showthread.php?t=105474&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Sat, 16 Jun 2012 19:37:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802120#M688</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-06-16T19:37:16Z</dc:date>
    </item>
    <item>
      <title>Java polynomial approx. faster than C code polynomial approx.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802121#M689</link>
      <description>&lt;DIV id="tiny_quote"&gt;&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A jquery1339887703843="58" rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=563040" href="https://community.intel.com/en-us/profile/563040/" class="basic"&gt;iliyapolak&lt;/A&gt;&lt;/DIV&gt;&lt;DIV style="background-color: #e5e5e5; margin-left: 2px; margin-right: 2px; border: 1px inset; padding: 5px;"&gt;&lt;I&gt;&lt;STRONG&gt;...&lt;/STRONG&gt;&lt;STRONG&gt;Do You preffer to continue our discussion in the Intel AVX and CPU instructions forum, because I posted there asm listings and measurements result moreover there are also very interesting responses from bronxzv here&lt;/STRONG&gt; &lt;A href="http://software.intel.com/en-us/forums/showthread.php?t=105474"&gt;http://software.intel.com/en-us/forums/showthread.php?t=105474&lt;/A&gt;&lt;BR /&gt;&lt;/I&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;BR /&gt;I will startreading allposts I missed ( from Page #9 to the Last ) soon.&lt;BR /&gt;&lt;BR /&gt;I think there is nothing wrongwith havingboth threads. However,the threadin the'&lt;STRONG&gt;Intel AVX and CPU&lt;/STRONG&gt;' forum&lt;BR /&gt;has grown significantly. You could also make a note that all the rest discussions will be done, for example,in&lt;BR /&gt;the '&lt;STRONG&gt;Software Tuning, Performance Optimization&lt;/STRONG&gt;' forum.&lt;BR /&gt;&lt;BR /&gt;So, it is up to you to decide what thread has to be 'alive' for discussions.&lt;BR /&gt;&lt;BR /&gt;Best regards,&lt;BR /&gt;Sergey&lt;/P&gt;</description>
      <pubDate>Sat, 16 Jun 2012 23:11:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802121#M689</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-06-16T23:11:46Z</dc:date>
    </item>
    <item>
      <title>Java polynomial approx. faster than C code polynomial approx.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802122#M690</link>
      <description>Hi everybody!&lt;BR /&gt;&lt;BR /&gt;I decided to continue this discussion in this thread &lt;A href="http://software.intel.com/en-us/forums/showpost.php?p=187714"&gt;http://software.intel.com/en-us/forums/showpost.php?p=187714&lt;/A&gt;</description>
      <pubDate>Sun, 17 Jun 2012 09:36:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802122#M690</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-06-17T09:36:09Z</dc:date>
    </item>
    <item>
      <title>Java polynomial approx. faster than C code polynomial approx.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802123#M691</link>
      <description>Have you make the test with JRockit JVM?</description>
      <pubDate>Sun, 24 Jun 2012 16:56:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802123#M691</guid>
      <dc:creator>Maycon_Oliveira</dc:creator>
      <dc:date>2012-06-24T16:56:38Z</dc:date>
    </item>
    <item>
      <title>Java polynomial approx. faster than C code polynomial approx.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802124#M692</link>
      <description>...and change the test with different values for -XX:MinYoung and others tuning variables...</description>
      <pubDate>Sun, 24 Jun 2012 16:58:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802124#M692</guid>
      <dc:creator>Maycon_Oliveira</dc:creator>
      <dc:date>2012-06-24T16:58:38Z</dc:date>
    </item>
    <item>
      <title>Java polynomial approx. faster than C code polynomial approx.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802125#M693</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;P&gt;...and change the test with different values for -XX:MinYoung and others tuning variables&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;BR /&gt;The tests were made with Sun(Oracle) JVM with two settingsclient and server.Code was compiled by incremential EclipseJDT compiler and ran by Java RuntimeEnvironment.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;java version "1.7.0"&lt;BR /&gt;Java SE Runtime Environment (build 1.7.0-b147)&lt;BR /&gt;Java HotSpot Client VM (build 21.0-b17, mixed mode)&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;Please bear in mind that the results posted in this thread are outdated because the C code was compiled in Debug mode hence highly optimizedbyJIT compiler code could easily outperform much slowernative code.&lt;BR /&gt;Please look here &lt;STRONG&gt;post #158&lt;/STRONG&gt; &lt;A href="http://software.intel.com/en-us/forums/showthread.php?t=105474"&gt;http://software.intel.com/en-us/forums/showthread.php?t=105474&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;And here are the results of aggresively optimized native code: &lt;BR /&gt;&lt;BR /&gt;Tested today fastsin() 1e6 iterations and the result was 15 millisec i.e ~33.39 cycles per iterationfor my CPU.&lt;BR /&gt;&lt;BR /&gt;results&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;start val of fastsin() 13214314&lt;BR /&gt;end val of fastsin() 13214329&lt;BR /&gt;running time of fastsin() release code is: 15 millisec&lt;BR /&gt;fastsin() is: 0.891207360591512180000000&lt;BR /&gt;&lt;/STRONG&gt;</description>
      <pubDate>Wed, 27 Jun 2012 05:55:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802125#M693</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-06-27T05:55:38Z</dc:date>
    </item>
    <item>
      <title>Java polynomial approx. faster than C code polynomial approx.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802126#M694</link>
      <description>No!&lt;BR /&gt;The compilerwas incrementialEclipse JDT.The tests were made with Sun JVM.</description>
      <pubDate>Thu, 28 Jun 2012 08:55:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Java-polynomial-approx-faster-than-C-code-polynomial-approx/m-p/802126#M694</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2012-06-28T08:55:52Z</dc:date>
    </item>
  </channel>
</rss>

