<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic http://www.agner.org/optimize in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/Binary-operations-per-clock-cycle-in-Intel-Xeon-Phi-processors/m-p/1129725#M6263</link>
    <description>&lt;P&gt;&lt;A href="http://www.agner.org/optimize/instruction_tables.pdf"&gt;http://www.agner.org/optimize/instruction_tables.pdf&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;has some useful timing information. XOR on KNL shows latency of 2, reciprocal throughput 0.5. If you want an XNOR you will have to NOT the result. For performance, you would want to interleave the XOR and NOT with other instruction(s).&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
    <pubDate>Sun, 24 Sep 2017 13:29:28 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2017-09-24T13:29:28Z</dc:date>
    <item>
      <title>Binary operations per clock cycle in Intel Xeon Phi processors.</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Binary-operations-per-clock-cycle-in-Intel-Xeon-Phi-processors/m-p/1129724#M6262</link>
      <description>&lt;DIV class="gmail_default" style="color: rgb(34, 34, 34); font-size: 12.8px; font-family: arial, helvetica, sans-serif;"&gt;To see the acceleration of XNOR-nets on CPUs, I have been reading a paper, which claims that most CPUs execute 64 binary operations in one clock cycle. Thus the speedup is calculated accordingly.&lt;/DIV&gt;

&lt;DIV class="gmail_default" style="color: rgb(34, 34, 34); font-size: 12.8px; font-family: arial, helvetica, sans-serif;"&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV class="gmail_default" style="color: rgb(34, 34, 34); font-size: 12.8px; font-family: arial, helvetica, sans-serif;"&gt;To calculate the speed up in the XNOR-net, i need to know how many binary operations per clock cycle can be executed by KNL processors. How can I find this information for a CPU?&lt;/DIV&gt;

&lt;DIV class="gmail_default" style="color: rgb(34, 34, 34); font-size: 12.8px; font-family: arial, helvetica, sans-serif;"&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV class="gmail_default" style="color: rgb(34, 34, 34); font-size: 12.8px; font-family: arial, helvetica, sans-serif;"&gt;Does AVX-512 imply that 512 bitwise operations are possible every clock cycle?&lt;/DIV&gt;

&lt;DIV class="gmail_default" style="color: rgb(34, 34, 34); font-size: 12.8px; font-family: arial, helvetica, sans-serif;"&gt;If this is indeed correct, can you suggest some material with the reference of which I can attempt to code bitwise convolution operations which take advantage of the Intel architecture?&lt;/DIV&gt;

&lt;DIV class="gmail_default" style="color: rgb(34, 34, 34); font-size: 12.8px; font-family: arial, helvetica, sans-serif;"&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV class="gmail_default" style="color: rgb(34, 34, 34); font-size: 12.8px; font-family: arial, helvetica, sans-serif;"&gt;Thank you!&lt;/DIV&gt;</description>
      <pubDate>Fri, 22 Sep 2017 15:51:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Binary-operations-per-clock-cycle-in-Intel-Xeon-Phi-processors/m-p/1129724#M6262</guid>
      <dc:creator>YAkha</dc:creator>
      <dc:date>2017-09-22T15:51:24Z</dc:date>
    </item>
    <item>
      <title>http://www.agner.org/optimize</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Binary-operations-per-clock-cycle-in-Intel-Xeon-Phi-processors/m-p/1129725#M6263</link>
      <description>&lt;P&gt;&lt;A href="http://www.agner.org/optimize/instruction_tables.pdf"&gt;http://www.agner.org/optimize/instruction_tables.pdf&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;has some useful timing information. XOR on KNL shows latency of 2, reciprocal throughput 0.5. If you want an XNOR you will have to NOT the result. For performance, you would want to interleave the XOR and NOT with other instruction(s).&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Sun, 24 Sep 2017 13:29:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Binary-operations-per-clock-cycle-in-Intel-Xeon-Phi-processors/m-p/1129725#M6263</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2017-09-24T13:29:28Z</dc:date>
    </item>
    <item>
      <title>AVX-512 includes XOR</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Binary-operations-per-clock-cycle-in-Intel-Xeon-Phi-processors/m-p/1129726#M6264</link>
      <description>&lt;P&gt;AVX-512 includes XOR operations, see&amp;nbsp;&lt;A href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=xor&amp;amp;techs=AVX_512"&gt;https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=xor&amp;amp;techs=AVX_512&lt;/A&gt; for instance.&lt;/P&gt;

&lt;P&gt;You need also to consider the difference between the instruction latency and throughput. (The statement that "the CPU can perform one XOR per cycle" is likely a statement about the throughput, i.e. when you have a lot of them they come out one per cycle, not the latency [the time from a specific &amp;nbsp;one starting to it ending]).&lt;/P&gt;

&lt;P&gt;&lt;A href="https://software.intel.com/en-us/articles/intel-architecture-code-analyzer"&gt;Intel Architecture Code Analyzer&lt;/A&gt; can show you throughput of small code-sequences on different Intel micro-architectures if you want to go that deep...&lt;/P&gt;</description>
      <pubDate>Mon, 25 Sep 2017 09:22:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Binary-operations-per-clock-cycle-in-Intel-Xeon-Phi-processors/m-p/1129726#M6264</guid>
      <dc:creator>James_C_Intel2</dc:creator>
      <dc:date>2017-09-25T09:22:35Z</dc:date>
    </item>
    <item>
      <title>Additionally:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Binary-operations-per-clock-cycle-in-Intel-Xeon-Phi-processors/m-p/1129727#M6265</link>
      <description>&lt;P&gt;Additionally:&lt;/P&gt;

&lt;P&gt;Timing, in addition to interleaving considerations, also depends on if the entire net is:&lt;/P&gt;

&lt;P&gt;a) contained in registers&lt;BR /&gt;
	b) contained in L1 cache&lt;BR /&gt;
	c) contained in L2 cache&lt;BR /&gt;
	d) contained in L3/LL cache&lt;BR /&gt;
	e) permutations of above&lt;BR /&gt;
	f) and most importantly: if all the XNORs are performed in the same bit position within each and between each 512-bit vectors .OR. arbitrarily placed inter/intra 512-bit vectors.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Mon, 25 Sep 2017 17:47:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Binary-operations-per-clock-cycle-in-Intel-Xeon-Phi-processors/m-p/1129727#M6265</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2017-09-25T17:47:49Z</dc:date>
    </item>
  </channel>
</rss>

