<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Gilles, in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/theoretical-peak-integer-performance/m-p/1036118#M4414</link>
    <description>&lt;P&gt;Gilles,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;integer fma is only available on Intel Xeon Phi. For exploring the Intel instruction set, I like the &lt;A href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"&gt;interactive intrinsics guide&lt;/A&gt;.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;The latency and throughput of instructions are described in Appendix C of the&amp;nbsp;&lt;A href="https://www-ssl.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html"&gt;Intel® 64 and IA-32 Architectures Optimization Reference Manual&lt;/A&gt;.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Kind regards&lt;/P&gt;

&lt;P&gt;Thomas&lt;/P&gt;</description>
    <pubDate>Fri, 12 Jun 2015 16:11:08 GMT</pubDate>
    <dc:creator>Thomas_W_Intel</dc:creator>
    <dc:date>2015-06-12T16:11:08Z</dc:date>
    <item>
      <title>theoretical peak integer performance</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/theoretical-peak-integer-performance/m-p/1036117#M4413</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;In order to play with roof-line charts for &lt;STRONG&gt;32b integer&lt;/STRONG&gt;-based code, I struggle to find what are the theoretical peak integer performances for Ivy Bridge, Haswell and Knights Corner processors and co-processors.&lt;/P&gt;

&lt;P&gt;For floating point, that's "easy": vector length / type length * 2 (for FMA) * #cores * freq&lt;/P&gt;

&lt;P&gt;Now, for integers, that's another story:&lt;/P&gt;

&lt;OL&gt;
	&lt;LI&gt;Ivy Bridge's 256b AVX doesn't support integer operations, but SSE 128b does support some... But which ones exactly? I saw an integer FMA for 16b integers, a 32b add with a 0.5 cycles throughput, and a 32b multiply with a 1 cycle throughput. Does that mean that I can in average expect a 1.5 multiply / add throughput (for a typical Matrix multiplication)?&lt;/LI&gt;
	&lt;LI&gt;For Haswell, 256b AVX2 does support some integer operations. But again, I didn't find any FMA for 32b data, only the 0.5 cycle add and 1 cycle multiply. So basically, same question here...&lt;/LI&gt;
	&lt;LI&gt;For Xeon Phi Knights Corner, apparently we do have a SSE 512b FMA for 32b integers. However, the throughput isn't given (I assume it's 1 cycle). So I can go for a "512 / 32 * 2 (for FMA) * freq * #cores" for the peak, right?&lt;/LI&gt;
&lt;/OL&gt;

&lt;P&gt;So altogether, what should be my theoretical 32b integer peak performances for these 3 architectures (and other possibly) for an 32b integer matrix-matrix multiplication kind of workload? And why?&lt;/P&gt;

&lt;P&gt;Thank you very much for any help on that&lt;/P&gt;

&lt;P&gt;Gilles&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 11 Jun 2015 12:36:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/theoretical-peak-integer-performance/m-p/1036117#M4413</guid>
      <dc:creator>gilles_c_1</dc:creator>
      <dc:date>2015-06-11T12:36:03Z</dc:date>
    </item>
    <item>
      <title>Gilles,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/theoretical-peak-integer-performance/m-p/1036118#M4414</link>
      <description>&lt;P&gt;Gilles,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;integer fma is only available on Intel Xeon Phi. For exploring the Intel instruction set, I like the &lt;A href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/"&gt;interactive intrinsics guide&lt;/A&gt;.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;The latency and throughput of instructions are described in Appendix C of the&amp;nbsp;&lt;A href="https://www-ssl.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html"&gt;Intel® 64 and IA-32 Architectures Optimization Reference Manual&lt;/A&gt;.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Kind regards&lt;/P&gt;

&lt;P&gt;Thomas&lt;/P&gt;</description>
      <pubDate>Fri, 12 Jun 2015 16:11:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/theoretical-peak-integer-performance/m-p/1036118#M4414</guid>
      <dc:creator>Thomas_W_Intel</dc:creator>
      <dc:date>2015-06-12T16:11:08Z</dc:date>
    </item>
    <item>
      <title>It is difficult to discuss</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/theoretical-peak-integer-performance/m-p/1036119#M4415</link>
      <description>&lt;P&gt;It is difficult to discuss peak integer performance without being more specific about what types of multiplication are required and whether the integers are signed or unsigned.&lt;/P&gt;

&lt;P&gt;For Ivy Bridge the peak 32-bit integer performance looks like 8 ops/cycle:&amp;nbsp; a 4-wide add (128-bit SSE or 128-bit AVX with packed doublewords) plus a 4-wide multiply (SSE4.1 PMULLD or AVX VPMULLD 128-bit with signed packed doublewords and only keeping the low half of the results).&amp;nbsp; If you need to keep all 64 bits of the multiply result, then the multiplication rate is halved.&amp;nbsp; You can use the PMULDQ/PMULUDQ instructions to multiply 2 of the 4 elements in a 128-bit register and store the 2 64-bit products in the output register.&amp;nbsp;&amp;nbsp; It looks like all of these are fully pipelined with single-cycle latency.&lt;/P&gt;

&lt;P&gt;For Haswell the peak 32-bit integer performance looks like 12 ops/cycle: an 8-wide AVX2 packed integer add and either an 8-wide 32-bit packed integer add (VPMULLD) saving the low-order 32 bits (but executing only once every 2 cycles) or a VPMULDQ that multiplies the even-numbered 32-bit sub-fields of two 256-bit registers and saves the 4 64-bit results in an output register.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;For Xeon Phi the peak 32-bit integer performance is also 16 ops/cycle if you can use the VPMADD instructions.&amp;nbsp; These discard the upper 32 bits of the result.&amp;nbsp;&amp;nbsp; Also note that you must be running at least 2 threads per physical core if you want to issue instructions every cycle.&amp;nbsp; Xeon Phi also supports an ordinary packed 32-bit ADD (VPADDD) and separate instructions for packed 32-bit multiplication that store the high-order and low-order 32-bits of the result.&amp;nbsp;&amp;nbsp; There is not a lot of documentation on latency and throughput for Xeon Phi vector instructions, but these are all very likely to be fully pipelined.&lt;/P&gt;

&lt;P&gt;Of course I might have gotten confused in there somewhere...&lt;/P&gt;</description>
      <pubDate>Sat, 13 Jun 2015 00:54:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/theoretical-peak-integer-performance/m-p/1036119#M4415</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2015-06-13T00:54:27Z</dc:date>
    </item>
    <item>
      <title>Wow, thank you very much,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/theoretical-peak-integer-performance/m-p/1036120#M4416</link>
      <description>&lt;P&gt;Wow, thank you very much, that's quite the answer: In addition to be very detailed and answering all of my questions, it also makes me feel like I'm not ashamed I didn't find it all by myself despite my long efforts.&lt;/P&gt;

&lt;P&gt;Thanks again, I really appreciate it.&lt;/P&gt;

&lt;P&gt;Gilles&lt;/P&gt;</description>
      <pubDate>Sat, 13 Jun 2015 05:22:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/theoretical-peak-integer-performance/m-p/1036120#M4416</guid>
      <dc:creator>gilles_c_1</dc:creator>
      <dc:date>2015-06-13T05:22:43Z</dc:date>
    </item>
  </channel>
</rss>

