<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic When I read Jim's answer I in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/code-optimization-and-vtune/m-p/1084394#M5736</link>
    <description>&lt;P&gt;When I read Jim's answer I realized that I there is dependency on FPU Divider latency which is around ~15-20 cycles. I suppose that Divider is pipelined so the next division uop(s) will be scheduled for execution after ~15-20 cycles, hence probably you are seeing slower performance for the second line of code. By looking at the assembly code snippet I suppose that ri and rj are constants(I can be wrong here) so why do not try to multiply by their inverse?&lt;/P&gt;</description>
    <pubDate>Thu, 14 Jan 2016 17:51:29 GMT</pubDate>
    <dc:creator>Bernard</dc:creator>
    <dc:date>2016-01-14T17:51:29Z</dc:date>
    <item>
      <title>code optimization and vtune</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/code-optimization-and-vtune/m-p/1084386#M5728</link>
      <description>&lt;P&gt;I am using vtune amplifier to profile a code and noticed something rather unusual (perhaps to me but not to the experts here). There are two back to back statements in the code:&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Source Line&amp;nbsp;&amp;nbsp; &amp;nbsp;Source&amp;nbsp;&amp;nbsp; &amp;nbsp;Effective Time by Utilization&amp;nbsp;&amp;nbsp; &amp;nbsp;Spin Time&amp;nbsp;&amp;nbsp; &amp;nbsp;Overhead Time&amp;nbsp;&amp;nbsp; &amp;nbsp;Instructions Retired&lt;BR /&gt;
	590&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; gi=gi*dd7/ri&amp;nbsp;&amp;nbsp; &amp;nbsp;133.138s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;1,471,982,200,000&lt;BR /&gt;
	591&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; gj=gj*dd8/rj&amp;nbsp;&amp;nbsp; &amp;nbsp;1320.961s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;6,402,068,400,000&lt;/P&gt;

&lt;P&gt;The two lines of code have very different execution time, almost a factor of 10, as reported by vtune amplifier. Also the number of instructions retired for the two lines of code are quite different also, one at&amp;nbsp; 1,471,982,200,000 and the other at &amp;nbsp;6,402,068,400,000.&lt;/P&gt;

&lt;P&gt;Could somebody explain what resulted in the differences and what optimizations are possible?&lt;/P&gt;

&lt;P&gt;Thank you!&lt;/P&gt;

&lt;P&gt;Zhiyong&lt;/P&gt;</description>
      <pubDate>Tue, 12 Jan 2016 22:03:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/code-optimization-and-vtune/m-p/1084386#M5728</guid>
      <dc:creator>Zhiyong_Z_</dc:creator>
      <dc:date>2016-01-12T22:03:43Z</dc:date>
    </item>
    <item>
      <title>Are these statements</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/code-optimization-and-vtune/m-p/1084387#M5729</link>
      <description>&lt;P&gt;Are&amp;nbsp;these statements involving arrays?&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Tue, 12 Jan 2016 22:54:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/code-optimization-and-vtune/m-p/1084387#M5729</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2016-01-12T22:54:46Z</dc:date>
    </item>
    <item>
      <title>Hi Jim,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/code-optimization-and-vtune/m-p/1084388#M5730</link>
      <description>&lt;P&gt;Hi Jim,&lt;/P&gt;

&lt;P&gt;These are not arrays.&lt;/P&gt;

&lt;P&gt;Zhiyong&lt;/P&gt;</description>
      <pubDate>Tue, 12 Jan 2016 23:24:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/code-optimization-and-vtune/m-p/1084388#M5730</guid>
      <dc:creator>Zhiyong_Z_</dc:creator>
      <dc:date>2016-01-12T23:24:27Z</dc:date>
    </item>
    <item>
      <title>Where you rely on the non-</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/code-optimization-and-vtune/m-p/1084389#M5731</link>
      <description>&lt;P&gt;Where you rely on the non- "precise" event counters, you can't directly attribute the event counts to a single instruction or source line.&amp;nbsp; In this case, it looks apparent that the second group of operations spends much of its quoted time waiting for the first to complete.&amp;nbsp; This is particularly likely on the CPU models which have high latencies for division, particularly including the time during which the fpu pipeline is blocked.&lt;/P&gt;

&lt;P&gt;Little can be said about possible optimization without looking at the context, possibility of vectorization, replacement of division by multiplication, ...&lt;/P&gt;</description>
      <pubDate>Wed, 13 Jan 2016 04:55:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/code-optimization-and-vtune/m-p/1084389#M5731</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2016-01-13T04:55:00Z</dc:date>
    </item>
    <item>
      <title>Tim P. has a point about the</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/code-optimization-and-vtune/m-p/1084390#M5732</link>
      <description>&lt;P&gt;Tim P. has a point about the non-precise (with respect to location of instruction) event counters. You can confirm this to some extent yourself by making a test run after swapping those two statements (both in source appear to have the same computational complexity). The efficiency of the code though may differ in that one or more of the variables in the faster statement may be registerized and not be in the slower statement. A secondary (but can be primary) cause can be if the code &lt;EM&gt;preceding &lt;/EM&gt;the two statements in question, has a tendency to pre-load the L1 and/or L2&amp;nbsp;cache. Some of the VTune counters should be able to provide the information relating to cache miss and/or memory stalls.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Wed, 13 Jan 2016 13:08:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/code-optimization-and-vtune/m-p/1084390#M5732</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2016-01-13T13:08:46Z</dc:date>
    </item>
    <item>
      <title>As Tim hinted there may be</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/code-optimization-and-vtune/m-p/1084391#M5733</link>
      <description>&lt;P&gt;As Tim hinted there may be present some kind of interdependency maybe at the uops level which slows down the computation. When looking at your short code sample I cannot see any present dependency between those two statements. Variable "gi" &amp;nbsp;is not used as an argument to compute the value of variable "gj" at least not in the code snippet.&lt;/P&gt;</description>
      <pubDate>Wed, 13 Jan 2016 13:29:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/code-optimization-and-vtune/m-p/1084391#M5733</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2016-01-13T13:29:44Z</dc:date>
    </item>
    <item>
      <title>I have copied the source</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/code-optimization-and-vtune/m-p/1084392#M5734</link>
      <description>&lt;P&gt;I have copied the source lines and the corresponding assembly codes for a "whole block" below. The few assignment source statements are crossed through as they are not assembled. The two source lines 590 and 591 and their assembly codes are highlighted and in italic.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;The assembly codes for the two source lines are always back to back and consist of move, mult, and div. There are no dependencies among the two and their operands are identical in register use and "memory access". The address in memory are&amp;nbsp; &lt;EM&gt;&lt;STRONG&gt;-0x228 and&amp;nbsp; -0x230 a&lt;/STRONG&gt;&lt;/EM&gt;nd the memory address accessed for the whole block ranges from -0x1d0 to -0x238. Please note that the memory address -0x238 is associated with source line 583, the other line that does a division just as the other lines at issue here but the reported time is only 45 sec. compared with 133 and 1320 for the other two lines respectively.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;With this information, can we conclude that the differences in time for these there lines, 583, 590, and 591, are due to cache and memory misses?&lt;/P&gt;

&lt;P&gt;Are there any other more explicit ways to see if there are indeed cache/memory misses?&lt;/P&gt;

&lt;P&gt;Why the instructions retired for these three lines are different, as much as order of magnitude?&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Seems like that division and multiplication takes about the same time as shown in this particular block of code?&lt;/P&gt;

&lt;P&gt;If one just wants to optimize the performance of this particular block of code, is it possible and what would be the best way to go about it? Do I need to arrange the storage of the variables used in this case and make sure that all the variables are accessed in the same memory access? Does reordering the order of some of the executions help at all? Does the execution interleave the memory access and computation?&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Thanks!&lt;/P&gt;

&lt;P&gt;Zhiyong&lt;/P&gt;

&lt;P&gt;Source Line&amp;nbsp;&amp;nbsp; &amp;nbsp;Source&amp;nbsp;&amp;nbsp; &amp;nbsp;Effective Time by Utilization&amp;nbsp;&amp;nbsp; &amp;nbsp;Spin Time&amp;nbsp;&amp;nbsp; &amp;nbsp;Overhead Time&amp;nbsp;&amp;nbsp; &amp;nbsp;Instructions Retired&lt;BR /&gt;
	571&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if(ideriv.gt.0) then&amp;nbsp;&amp;nbsp; &amp;nbsp;9.355s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;102,596,000,000&lt;BR /&gt;
	&lt;S&gt;572&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; gp=pc&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	573&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; gu=pu&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	574&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; guu=puu&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	575&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; gi=ppi&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	576&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; gii=pii&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	577&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; gj=pj&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	578&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; gjj=pjj&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	579&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; gui=pui&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	580&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; guj=puj&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;/S&gt;&lt;BR /&gt;
	581&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	582&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; guu=guu*dd1*dd1+gu*dd2&amp;nbsp;&amp;nbsp; &amp;nbsp;67.918s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;826,953,400,000&lt;BR /&gt;
	583&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; gu=gu*dd1/rij&amp;nbsp;&amp;nbsp; &amp;nbsp;45.721s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;536,416,400,000&lt;BR /&gt;
	584&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	585&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; gui=gui*dd1*dd7&amp;nbsp;&amp;nbsp; &amp;nbsp;32.883s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;375,057,800,000&lt;BR /&gt;
	586&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; guj=guj*dd1*dd8&amp;nbsp;&amp;nbsp; &amp;nbsp;0.228s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;1,957,800,000&lt;BR /&gt;
	587&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	588&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; gii=gii*dd7*dd7+gi*dd9&amp;nbsp;&amp;nbsp; &amp;nbsp;93.586s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;1,034,625,800,000&lt;BR /&gt;
	589&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; gjj=gjj*dd8*dd8+gj*dd10&amp;nbsp;&amp;nbsp; &amp;nbsp;586.273s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;6,776,933,800,000&lt;BR /&gt;
	&lt;EM&gt;&lt;STRONG&gt;590&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; gi=gi*dd7/ri&amp;nbsp;&amp;nbsp; &amp;nbsp;133.138s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;1,471,982,200,000&lt;BR /&gt;
	591&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; gj=gj*dd8/rj&amp;nbsp;&amp;nbsp; &amp;nbsp;1320.961s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;6,402,068,400,000&lt;/STRONG&gt;&lt;/EM&gt;&lt;BR /&gt;
	592&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	593&amp;nbsp;&amp;nbsp; &amp;nbsp;!!!!&amp;nbsp; een for periodic systems&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; WAS&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	594&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; if(icutjasc .gt. 0 .or. iperiodic .ne. 0) then&amp;nbsp;&amp;nbsp; &amp;nbsp;626.434s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;963,357,200,000&lt;/P&gt;

&lt;P&gt;Assembly code:&lt;/P&gt;

&lt;P&gt;Address&amp;nbsp;&amp;nbsp; &amp;nbsp;Source Line&amp;nbsp;&amp;nbsp; &amp;nbsp;Assembly&amp;nbsp;&amp;nbsp; &amp;nbsp;Effective Time by Utilization&amp;nbsp;&amp;nbsp; &amp;nbsp;Spin Time&amp;nbsp;&amp;nbsp; &amp;nbsp;Overhead Time&amp;nbsp;&amp;nbsp; &amp;nbsp;Instructions Retired&lt;BR /&gt;
	0x8b36bf&amp;nbsp;&amp;nbsp; &amp;nbsp;571&amp;nbsp;&amp;nbsp; &amp;nbsp;jle 0x8b3da6 &amp;lt;Block 269&amp;gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	0x8b36c5&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;Block 255:&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	0x8b36c5&amp;nbsp;&amp;nbsp; &amp;nbsp;582&amp;nbsp;&amp;nbsp; &amp;nbsp;movsdq&amp;nbsp; -0x210(%rbp), %xmm2&amp;nbsp;&amp;nbsp; &amp;nbsp;5.979s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;70,899,400,000&lt;BR /&gt;
	0x8b36cd&amp;nbsp;&amp;nbsp; &amp;nbsp;583&amp;nbsp;&amp;nbsp; &amp;nbsp;movaps %xmm11, %xmm1&amp;nbsp;&amp;nbsp; &amp;nbsp;14.281s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;173,641,000,000&lt;BR /&gt;
	0x8b36d1&amp;nbsp;&amp;nbsp; &amp;nbsp;588&amp;nbsp;&amp;nbsp; &amp;nbsp;movsdq&amp;nbsp; -0x208(%rbp), %xmm0&amp;nbsp;&amp;nbsp; &amp;nbsp;25.805s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;297,830,000,000&lt;BR /&gt;
	0x8b36d9&amp;nbsp;&amp;nbsp; &amp;nbsp;582&amp;nbsp;&amp;nbsp; &amp;nbsp;mulsdq&amp;nbsp; -0x1e8(%rbp), %xmm9&amp;nbsp;&amp;nbsp; &amp;nbsp;25.008s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;315,564,600,000&lt;BR /&gt;
	0x8b36e2&amp;nbsp;&amp;nbsp; &amp;nbsp;588&amp;nbsp;&amp;nbsp; &amp;nbsp;mulsdq&amp;nbsp; -0x1e0(%rbp), %xmm7&amp;nbsp;&amp;nbsp; &amp;nbsp;18.367s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;165,534,200,000&lt;BR /&gt;
	0x8b36ea&amp;nbsp;&amp;nbsp; &amp;nbsp;582&amp;nbsp;&amp;nbsp; &amp;nbsp;mulsd %xmm11, %xmm2&amp;nbsp;&amp;nbsp; &amp;nbsp;17.322s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;222,445,600,000&lt;BR /&gt;
	0x8b36ef&amp;nbsp;&amp;nbsp; &amp;nbsp;588&amp;nbsp;&amp;nbsp; &amp;nbsp;mulsd %xmm5, %xmm0&amp;nbsp;&amp;nbsp; &amp;nbsp;24.540s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;281,242,000,000&lt;BR /&gt;
	0x8b36f3&amp;nbsp;&amp;nbsp; &amp;nbsp;583&amp;nbsp;&amp;nbsp; &amp;nbsp;mulsdq&amp;nbsp; -0x1d0(%rbp), %xmm1&amp;nbsp;&amp;nbsp; &amp;nbsp;25.527s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;291,636,800,000&lt;BR /&gt;
	0x8b36fb&amp;nbsp;&amp;nbsp; &amp;nbsp;582&amp;nbsp;&amp;nbsp; &amp;nbsp;addsd %xmm9, %xmm2&amp;nbsp;&amp;nbsp; &amp;nbsp;12.469s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;137,586,800,000&lt;BR /&gt;
	0x8b3700&amp;nbsp;&amp;nbsp; &amp;nbsp;589&amp;nbsp;&amp;nbsp; &amp;nbsp;mulsdq&amp;nbsp; -0x200(%rbp), %xmm10&amp;nbsp;&amp;nbsp; &amp;nbsp;25.294s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;259,220,000,000&lt;BR /&gt;
	0x8b3709&amp;nbsp;&amp;nbsp; &amp;nbsp;588&amp;nbsp;&amp;nbsp; &amp;nbsp;addsd %xmm7, %xmm0&amp;nbsp;&amp;nbsp; &amp;nbsp;20.213s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;237,200,600,000&lt;BR /&gt;
	0x8b370d&amp;nbsp;&amp;nbsp; &amp;nbsp;585&amp;nbsp;&amp;nbsp; &amp;nbsp;mulsdq&amp;nbsp; -0x1f0(%rbp), %xmm12&amp;nbsp;&amp;nbsp; &amp;nbsp;26.725s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;305,502,600,000&lt;BR /&gt;
	0x8b3716&amp;nbsp;&amp;nbsp; &amp;nbsp;583&amp;nbsp;&amp;nbsp; &amp;nbsp;divsdq&amp;nbsp; -0x238(%rbp), %xmm1&amp;nbsp;&amp;nbsp; &amp;nbsp;5.913s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;71,138,600,000&lt;BR /&gt;
	0x8b371e&amp;nbsp;&amp;nbsp; &amp;nbsp;589&amp;nbsp;&amp;nbsp; &amp;nbsp;movsdq&amp;nbsp; -0x218(%rbp), %xmm9&amp;nbsp;&amp;nbsp; &amp;nbsp;554.227s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;6,440,345,600,000&lt;BR /&gt;
	&lt;EM&gt;&lt;STRONG&gt;0x8b3727&amp;nbsp;&amp;nbsp; &amp;nbsp;590&amp;nbsp;&amp;nbsp; &amp;nbsp;movsdq&amp;nbsp; -0x1c8(%rbp), %xmm7&amp;nbsp;&amp;nbsp; &amp;nbsp;1.367s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;15,813,200,000&lt;/STRONG&gt;&lt;/EM&gt;&lt;BR /&gt;
	&lt;EM&gt;&lt;STRONG&gt;0x8b372f&amp;nbsp;&amp;nbsp; &amp;nbsp;591&amp;nbsp;&amp;nbsp; &amp;nbsp;movsdq&amp;nbsp; -0x1d8(%rbp), %xmm3&amp;nbsp;&amp;nbsp; &amp;nbsp;3.675s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;47,980,400,000&lt;/STRONG&gt;&lt;/EM&gt;&lt;BR /&gt;
	0x8b3737&amp;nbsp;&amp;nbsp; &amp;nbsp;589&amp;nbsp;&amp;nbsp; &amp;nbsp;mulsd %xmm8, %xmm9&amp;nbsp;&amp;nbsp; &amp;nbsp;0.006s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;15,600,000&lt;BR /&gt;
	&lt;EM&gt;&lt;STRONG&gt;0x8b373c&amp;nbsp;&amp;nbsp; &amp;nbsp;590&amp;nbsp;&amp;nbsp; &amp;nbsp;mulsd %xmm5, %xmm7&amp;nbsp;&amp;nbsp; &amp;nbsp;65.909s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;735,233,200,000&lt;BR /&gt;
	0x8b3740&amp;nbsp;&amp;nbsp; &amp;nbsp;591&amp;nbsp;&amp;nbsp; &amp;nbsp;mulsd %xmm8, %xmm3&amp;nbsp;&amp;nbsp; &amp;nbsp;1.284s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;15,758,600,000&lt;/STRONG&gt;&lt;/EM&gt;&lt;BR /&gt;
	0x8b3745&amp;nbsp;&amp;nbsp; &amp;nbsp;589&amp;nbsp;&amp;nbsp; &amp;nbsp;addsd %xmm10, %xmm9&amp;nbsp;&amp;nbsp; &amp;nbsp;3.635s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;44,421,000,000&lt;BR /&gt;
	0x8b374a&amp;nbsp;&amp;nbsp; &amp;nbsp;586&amp;nbsp;&amp;nbsp; &amp;nbsp;mulsdq&amp;nbsp; -0x1f8(%rbp), %xmm6&amp;nbsp;&amp;nbsp; &amp;nbsp;0.005s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;36,400,000&lt;BR /&gt;
	&lt;EM&gt;&lt;STRONG&gt;0x8b3752&amp;nbsp;&amp;nbsp; &amp;nbsp;590&amp;nbsp;&amp;nbsp; &amp;nbsp;divsdq&amp;nbsp; -0x228(%rbp), %xmm7&amp;nbsp;&amp;nbsp; &amp;nbsp;65.862s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;720,935,800,000&lt;BR /&gt;
	0x8b375a&amp;nbsp;&amp;nbsp; &amp;nbsp;591&amp;nbsp;&amp;nbsp; &amp;nbsp;divsdq&amp;nbsp; -0x230(%rbp), %xmm3&amp;nbsp;&amp;nbsp; &amp;nbsp;1316.002s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;6,338,329,400,000&lt;/STRONG&gt;&lt;/EM&gt;&lt;BR /&gt;
	0x8b3762&amp;nbsp;&amp;nbsp; &amp;nbsp;594&amp;nbsp;&amp;nbsp; &amp;nbsp;cmpl&amp;nbsp; $0x0, -0x220(%rbp)&amp;nbsp;&amp;nbsp; &amp;nbsp;591.793s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;907,816,000,000&lt;BR /&gt;
	0x8b3769&amp;nbsp;&amp;nbsp; &amp;nbsp;594&amp;nbsp;&amp;nbsp; &amp;nbsp;jle 0x8b6ca3 &amp;lt;Block 376&amp;gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;0.028s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;0s&amp;nbsp;&amp;nbsp; &amp;nbsp;5,200,000&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 14 Jan 2016 00:31:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/code-optimization-and-vtune/m-p/1084392#M5734</guid>
      <dc:creator>Zhiyong_Z_</dc:creator>
      <dc:date>2016-01-14T00:31:28Z</dc:date>
    </item>
    <item>
      <title>The two movsdq's are from the</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/code-optimization-and-vtune/m-p/1084393#M5735</link>
      <description>&lt;P&gt;The two movsdq's are from the same memory addressable cache line, and the second instruction getting charged more. This illustrates what Tim P was talking about where the overhead time appears to be billed to a different or following instruction. Your interpretation of VTune's counters has to take this into consideration.&lt;/P&gt;

&lt;P&gt;The two mulsd's with first taking more time, likely reflect that the first instruction dependent on xmm7&amp;nbsp;was assessed the&amp;nbsp;memory fetch time&amp;nbsp;into the same cache line where xmm3 will get its data from. The lesser time for the second mulsd reflects no memory (or L3 or L2) stall occurred in getting the data into xmm3.&lt;/P&gt;

&lt;P&gt;The divsq's (my interpretation) reflect a similar memory latency on the fetch into same cache line (holding -0x228 and 0x230 off rbp) with both first instructions (mulsd pair and divsdq pair) at around 65.9s, but that the SSE FPU can only perform one division at a time and the second&amp;nbsp; instruction had to wait.&lt;/P&gt;

&lt;P&gt;I project that by swapping these two statements around, that you will observe the second statement taking longer.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Thu, 14 Jan 2016 17:13:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/code-optimization-and-vtune/m-p/1084393#M5735</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2016-01-14T17:13:24Z</dc:date>
    </item>
    <item>
      <title>When I read Jim's answer I</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/code-optimization-and-vtune/m-p/1084394#M5736</link>
      <description>&lt;P&gt;When I read Jim's answer I realized that I there is dependency on FPU Divider latency which is around ~15-20 cycles. I suppose that Divider is pipelined so the next division uop(s) will be scheduled for execution after ~15-20 cycles, hence probably you are seeing slower performance for the second line of code. By looking at the assembly code snippet I suppose that ri and rj are constants(I can be wrong here) so why do not try to multiply by their inverse?&lt;/P&gt;</description>
      <pubDate>Thu, 14 Jan 2016 17:51:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/code-optimization-and-vtune/m-p/1084394#M5736</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2016-01-14T17:51:29Z</dc:date>
    </item>
  </channel>
</rss>

