<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Single precision ADD and MUL latencies on Ivybridge in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Single-precision-ADD-and-MUL-latencies-on-Ivybridge/m-p/849574#M1155</link>
    <description>Tim/Intel,&lt;DIV&gt;&amp;nbsp; &amp;nbsp; This latency issue was bothering me. &amp;nbsp;So I dug in for a half hour this morning and built a simple asm test you can use to see what I'm talking about. &amp;nbsp;Interestingly.. the behavior appears to be different on SB and on IvyB.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; I've added 2 files.. just compile with gcc &amp;nbsp;and run this test. &amp;nbsp;You'll note you get a latency of 4 clocks. &amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; Now.. if you comment out lines 21 and 22.. which removes the replicated floats in bits 32-127, you'll see you get a latency of 3 clocks. &amp;nbsp;Replicating the dest had no effect on latency.. but the sources.. having non-zero bits in 32-127, appears to extend the latency of addss by 1 clock, and likely mulss as well. &amp;nbsp;I doubt this is the case very many times.. but still.. can you inform me why this is the case? &amp;nbsp;What changed from SB to IvyB?&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp;Addss only needs bits 0-31, it doesn't rely upon the source value of source 2 for bits 32-127, why does this matter?&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;Perfwise&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Sun, 26 Aug 2012 15:53:06 GMT</pubDate>
    <dc:creator>perfwise</dc:creator>
    <dc:date>2012-08-26T15:53:06Z</dc:date>
    <item>
      <title>Single precision ADD and MUL latencies on Ivybridge</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Single-precision-ADD-and-MUL-latencies-on-Ivybridge/m-p/849570#M1151</link>
      <description>Hi,&lt;DIV&gt;&amp;nbsp; &amp;nbsp; I've looked in the opt guide, and it states that latencies for SP and DP fp ADD and MUL instructions is still 3 and 5 cycles.. but on ADDSS and ADDPS I measure a 4 cycle latency now, whereas on Intel SB it was 3. &amp;nbsp;DP variants are still 3 cycles (ADDSD and ADDPD). &amp;nbsp;Likewise.. I measure on MULSS and MULPS a latency of 6 cycles now.. whereas I only measured a latency of 5 before. &amp;nbsp;DP is the same, 5 cycles as before. &amp;nbsp;I am doing repetitive loops with lots of one instructions to determine throughput.. and latency is similarly determined but now with chained dependencies. &amp;nbsp;So a chained dependency would be:&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;addss xmm0,xmm1&lt;/DIV&gt;&lt;DIV&gt;addss xmm0,xmm2&lt;/DIV&gt;&lt;DIV&gt;addss xmm0,xmm1&lt;/DIV&gt;&lt;DIV&gt;addss xmm0,xmm2&lt;/DIV&gt;&lt;DIV&gt;...&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;where xmm1 and xmm2 are the negatives of one another.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;Thanks for any advice..&lt;/DIV&gt;&lt;DIV&gt;Perfwise&lt;/DIV&gt;</description>
      <pubDate>Wed, 22 Aug 2012 19:18:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Single-precision-ADD-and-MUL-latencies-on-Ivybridge/m-p/849570#M1151</guid>
      <dc:creator>perfwise</dc:creator>
      <dc:date>2012-08-22T19:18:31Z</dc:date>
    </item>
    <item>
      <title>Single precision ADD and MUL latencies on Ivybridge</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Single-precision-ADD-and-MUL-latencies-on-Ivybridge/m-p/849571#M1152</link>
      <description>Hi,&lt;DIV&gt;&amp;nbsp; &amp;nbsp; Just inquiring whether anyone at Intel knows what the answer to the question above is. &amp;nbsp;It's quite easy to build this code and run it and get the latency of the instruction and confirm or correct me. &amp;nbsp;Thanks for any helpful response..&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;Perfwise&lt;/DIV&gt;</description>
      <pubDate>Sat, 25 Aug 2012 14:30:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Single-precision-ADD-and-MUL-latencies-on-Ivybridge/m-p/849571#M1152</guid>
      <dc:creator>perfwise</dc:creator>
      <dc:date>2012-08-25T14:30:15Z</dc:date>
    </item>
    <item>
      <title>Single precision ADD and MUL latencies on Ivybridge</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Single-precision-ADD-and-MUL-latencies-on-Ivybridge/m-p/849572#M1153</link>
      <description>According to what we were told yesterday, core update 3 instruction latencies haven't been validated for publication.&amp;nbsp; Yes, I find it even easier to make a mistake on those update numbers than on the internal model nicknames which we have been asked not to use.&lt;BR /&gt;The primary emphasis is on the vmulss, vmulps, and the like, which immediately zero the upper register contents rather than preserving dependencies on previous instructions.&amp;nbsp; The compilers don't necessarily account for those differences yet.&amp;nbsp; You might check the generated code, particularly if you aren't using intrinsics.&amp;nbsp; The compiler would insert instructions such as xorps to break those hidden dependencies for SSE.&lt;BR /&gt;I don't know if the legacy SSE instruction latencies would be quoted if they should come out different from the AVX ones.</description>
      <pubDate>Sat, 25 Aug 2012 15:48:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Single-precision-ADD-and-MUL-latencies-on-Ivybridge/m-p/849572#M1153</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2012-08-25T15:48:39Z</dc:date>
    </item>
    <item>
      <title>Single precision ADD and MUL latencies on Ivybridge</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Single-precision-ADD-and-MUL-latencies-on-Ivybridge/m-p/849573#M1154</link>
      <description>Tim,&lt;DIV&gt;&amp;nbsp; &amp;nbsp; Yeah.. but I'm coding in assembly and have about 2500 different tests which measure latency and throughput, and they're accurate on x86 arch. &amp;nbsp;I've noted that both SSE and AVX addss and addps instructions come up as 1 clk longer in latency on my Ivybridge processor than on SandyBridge. &amp;nbsp;Can you confirm.. on your end that you see the same.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; The opt guide of the Intel devloper pdf says there's no diff in latency.. but something's amiss. &amp;nbsp;I could code up an assembly test for ya if it's too much work.. but it's quite simple to build a test to do this. &amp;nbsp;I just want to confirm what I'm seeing. &amp;nbsp;FP load latency is the same as it was in SB, as is the integer load latency (int is 4 clks, fp is 5-6 clks looks like [lea latencies interfer though with this because of the form I'm using and your implementation]).&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;Let me know.. and maybe tomorrow I'll post up an assembly test you could run.. but thought I'd bring this to your attention.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;Perfwise&lt;/DIV&gt;</description>
      <pubDate>Sat, 25 Aug 2012 22:52:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Single-precision-ADD-and-MUL-latencies-on-Ivybridge/m-p/849573#M1154</guid>
      <dc:creator>perfwise</dc:creator>
      <dc:date>2012-08-25T22:52:27Z</dc:date>
    </item>
    <item>
      <title>Single precision ADD and MUL latencies on Ivybridge</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Single-precision-ADD-and-MUL-latencies-on-Ivybridge/m-p/849574#M1155</link>
      <description>Tim/Intel,&lt;DIV&gt;&amp;nbsp; &amp;nbsp; This latency issue was bothering me. &amp;nbsp;So I dug in for a half hour this morning and built a simple asm test you can use to see what I'm talking about. &amp;nbsp;Interestingly.. the behavior appears to be different on SB and on IvyB.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; I've added 2 files.. just compile with gcc &amp;nbsp;and run this test. &amp;nbsp;You'll note you get a latency of 4 clocks. &amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp; Now.. if you comment out lines 21 and 22.. which removes the replicated floats in bits 32-127, you'll see you get a latency of 3 clocks. &amp;nbsp;Replicating the dest had no effect on latency.. but the sources.. having non-zero bits in 32-127, appears to extend the latency of addss by 1 clock, and likely mulss as well. &amp;nbsp;I doubt this is the case very many times.. but still.. can you inform me why this is the case? &amp;nbsp;What changed from SB to IvyB?&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp;Addss only needs bits 0-31, it doesn't rely upon the source value of source 2 for bits 32-127, why does this matter?&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;Perfwise&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Sun, 26 Aug 2012 15:53:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Single-precision-ADD-and-MUL-latencies-on-Ivybridge/m-p/849574#M1155</guid>
      <dc:creator>perfwise</dc:creator>
      <dc:date>2012-08-26T15:53:06Z</dc:date>
    </item>
    <item>
      <title>Single precision ADD and MUL latencies on Ivybridge</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Single-precision-ADD-and-MUL-latencies-on-Ivybridge/m-p/849575#M1156</link>
      <description>I'm not an expert on this question; I have remote access to an IVB single CPU lab machine when required, but my main job has been MIC/Phi.&lt;BR /&gt;You're correct AFAIK that IVB/core i7-3 is advertised as not changing latencies, aside from improvements in divide/sqrt/unaligned 256-bit.&amp;nbsp; We havn't reached the stage of seeing in practice whether more cores will be available.&lt;BR /&gt;As I pointed out, performance of AVX-128/256 is probably a higher priority than SSE2, and the treatment of both AVX and SSE by production compilers is probably of greatest importance to most people; there are potentially significant differences in those compiler treatments.&amp;nbsp; From a personal point of view, I would not be surprised by a small change in performance of "legacy" (SSE/SSE2) instructions.&amp;nbsp; I looked superficially at compiler treatments of SNB vs. IVB and didn't see likely distinctions made such as might take advantage of (usually minor) changes in IVB. &lt;BR /&gt;You may have intended to attach something, which didn't happen.</description>
      <pubDate>Mon, 27 Aug 2012 02:04:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Single-precision-ADD-and-MUL-latencies-on-Ivybridge/m-p/849575#M1156</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2012-08-27T02:04:11Z</dc:date>
    </item>
    <item>
      <title>Single precision ADD and MUL latencies on Ivybridge</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Single-precision-ADD-and-MUL-latencies-on-Ivybridge/m-p/849576#M1157</link>
      <description>Tim,&lt;DIV&gt;&amp;nbsp; &amp;nbsp; The files are attatched now. &amp;nbsp;I also believe the observations are the same for VADDSS and VMULSS. &amp;nbsp;So this isn't an SSE statement. &amp;nbsp;Though. as I said it's unlikely to have much performance impact.. I just wanted to know why I'm observing this. &amp;nbsp;Thanks..&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;Perfwise&lt;/DIV&gt;</description>
      <pubDate>Mon, 27 Aug 2012 12:16:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Single-precision-ADD-and-MUL-latencies-on-Ivybridge/m-p/849576#M1157</guid>
      <dc:creator>perfwise</dc:creator>
      <dc:date>2012-08-27T12:16:46Z</dc:date>
    </item>
  </channel>
</rss>

