<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Cheng, in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/64-bit-integer-multiplication/m-p/1003471#M31103</link>
    <description>&lt;P&gt;Cheng,&lt;/P&gt;

&lt;P&gt;As you see from my abstract notation you have at least 12 operations of * and +. Add to this the carry propagation and you would be up to at least 20 operations, but it could be somewhat more.&lt;/P&gt;

&lt;P&gt;This would be an interesting case to study. The scalar code, though it can do it in one operation has the issue of more loads and stores (even if coming from L1)&lt;/P&gt;

&lt;P&gt;Phi: two loads, 20-30 operations two stores (24 to 34 instructions 4xL1 latencies) to produce 8 128-bit results&lt;/P&gt;

&lt;P&gt;Scalar: 16 loads,&amp;nbsp;8 operations, 16 stores (40 instructions 32xL1 latencies) to produce 8 128-bit results&lt;/P&gt;

&lt;P&gt;As ugly as the vector form will be, it still could outperform the scalar code.&lt;/P&gt;

&lt;P&gt;This does assume you have full vectors to work with.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
    <pubDate>Fri, 18 Jul 2014 13:00:43 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2014-07-18T13:00:43Z</dc:date>
    <item>
      <title>64-bit integer multiplication</title>
      <link>https://community.intel.com/t5/Software-Archive/64-bit-integer-multiplication/m-p/1003466#M31098</link>
      <description>&lt;P&gt;Dear all,&lt;/P&gt;

&lt;P&gt;I need to conduct some 64-bit integers multiplications.&lt;/P&gt;

&lt;P&gt;In Xeon CPU, for the scalar multiplication operation, I have the mulq assembly instruction. What&amp;nbsp;it dose is it multiplies 2 64-bit integers and stores the 128-bit result in 2 64-bit&amp;nbsp;integers for the low 64 bits and high 64 bits of the result. I want to conduct the calculation on Phi and&amp;nbsp;vectorize it.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I looked through the Xeon Phi Coprocessor Instruction Set Reference Mannual, and I&amp;nbsp;found the vectorizable VPMULHD and VPMULLD instructions (and the equivalent intrinsics). However, they only operates on 32-bit&amp;nbsp;integers. And I have to use 2 instructions (VPMULHD and VPMULLD) to get the low and high part of the 64-bit result&amp;nbsp;separately. Obviously, they are less efficient than the mulq instruction for 64-bit integers multiplication if I can vectorize the mulq instruction on Phi.&lt;/P&gt;

&lt;P&gt;My question is:&lt;BR /&gt;
	1. Is there an equavilent instruction to mulq in Xeon Phi, so that I can vectorize the operations?&lt;BR /&gt;
	2. If I have to use the VPMULHD and VPMULLD instructions for 64-bit integers multiplication, I need to use those instructions to multiply&amp;nbsp;2 32-bit integers first. They multiply 2 32-bit integers but only store the high part and low part of the 64-bit result reperately. So I need&amp;nbsp;two instructions for 32-bit integers multiplication. Is there a way&amp;nbsp;that I can conduct the 32-bit integers multiplication in only one instruction, and I can vectorize it at the same time?&lt;/P&gt;

&lt;P&gt;Thanks very much for the help.&lt;/P&gt;</description>
      <pubDate>Thu, 17 Jul 2014 15:05:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/64-bit-integer-multiplication/m-p/1003466#M31098</guid>
      <dc:creator>Cheng_C_</dc:creator>
      <dc:date>2014-07-17T15:05:14Z</dc:date>
    </item>
    <item>
      <title>It is much worst than that</title>
      <link>https://community.intel.com/t5/Software-Archive/64-bit-integer-multiplication/m-p/1003467#M31099</link>
      <description>&lt;P&gt;It is much worst than that&lt;/P&gt;

&lt;P&gt;AL = low 32-bits of 64-bit A, AH = high 32-bits of 64-bit A, BL=low 32-bits of 64-bit B, BH=high 32-bits of 64-bit.&lt;/P&gt;

&lt;P&gt;CLL = low 32-bits of low 64-bits of 128-bit result C, CLH= high 32-bits of low 64-bits of 128-bit result C&lt;/P&gt;

&lt;P&gt;CHL = low 32-bits of high 64-bits of 128-bit result C, CHH= high 32-bits of&amp;nbsp;high 64-bits of 128-bit result C&lt;/P&gt;

&lt;P&gt;I will let you sort out the unsigned and signed intrinsic, as well as suitability of using temps or redoing multiply&lt;/P&gt;

&lt;P&gt;CLL = low(AL * BL)&lt;/P&gt;

&lt;P&gt;CLH = high(AL*BL) + low(AH*BL) + low(AL*BH)&lt;/P&gt;

&lt;P&gt;CHL =&amp;nbsp;high(AH*BL) + high(AL*BH) + low(AH * BH)&lt;/P&gt;

&lt;P&gt;CHH = high(AH * BH)&lt;/P&gt;

&lt;P&gt;Added to the above, is potential carry detection and propagation.&lt;/P&gt;

&lt;P&gt;I think you are much better off using scalar instructions in GP registers&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 17 Jul 2014 15:42:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/64-bit-integer-multiplication/m-p/1003467#M31099</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2014-07-17T15:42:19Z</dc:date>
    </item>
    <item>
      <title>Quote:jimdempseyatthecove</title>
      <link>https://community.intel.com/t5/Software-Archive/64-bit-integer-multiplication/m-p/1003468#M31100</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;jimdempseyatthecove wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;It is much worst than that&lt;/P&gt;

&lt;P&gt;AL = low 32-bits of 64-bit A, AH = high 32-bits of 64-bit A, BL=low 32-bits of 64-bit B, BH=high 32-bits of 64-bit.&lt;/P&gt;

&lt;P&gt;CLL = low 32-bits of low 64-bits of 128-bit result C, CLH= high 32-bits of low 64-bits of 128-bit result C&lt;/P&gt;

&lt;P&gt;CHL = low 32-bits of high 64-bits of 128-bit result C, CHH= high 32-bits of&amp;nbsp;high 64-bits of 128-bit result C&lt;/P&gt;

&lt;P&gt;I will let you sort out the unsigned and signed intrinsic, as well as suitability of using temps or redoing multiply&lt;/P&gt;

&lt;P&gt;CLL = low(AL * BL)&lt;/P&gt;

&lt;P&gt;CLH = high(AL*BL) + low(AH*BL) + low(AL*BH)&lt;/P&gt;

&lt;P&gt;CHL =&amp;nbsp;high(AH*BL) + high(AL*BH) + low(AH * BH)&lt;/P&gt;

&lt;P&gt;CHH = high(AH * BH)&lt;/P&gt;

&lt;P&gt;Added to the above, is potential carry detection and propagation.&lt;/P&gt;

&lt;P&gt;I think you are much better off using scalar instructions in GP registers&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Dear Jim,&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Thanks very much for the detailed explanations and your suggestion. So you mean there is no such an vectorizable instruction for Xeon Phi that I could directly multiply 2 64-bit integers?&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Regards,&lt;/P&gt;

&lt;P&gt;Cheng&lt;/P&gt;</description>
      <pubDate>Thu, 17 Jul 2014 19:19:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/64-bit-integer-multiplication/m-p/1003468#M31100</guid>
      <dc:creator>Cheng_C_</dc:creator>
      <dc:date>2014-07-17T19:19:15Z</dc:date>
    </item>
    <item>
      <title>The Xeon Phi Instruction Set</title>
      <link>https://community.intel.com/t5/Software-Archive/64-bit-integer-multiplication/m-p/1003469#M31101</link>
      <description>&lt;P&gt;The Xeon Phi Instruction Set Architecture manual (document 327364-001, September 2012) does not list any arithmetic instructions for vectors of 64-bit integers.&amp;nbsp; There appears to be a complete set of load/store and Boolean operations for this data type, but no add/subtract/multiply/etc.&lt;/P&gt;</description>
      <pubDate>Thu, 17 Jul 2014 22:07:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/64-bit-integer-multiplication/m-p/1003469#M31101</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2014-07-17T22:07:09Z</dc:date>
    </item>
    <item>
      <title>Quote:John D. McCalpin wrote:</title>
      <link>https://community.intel.com/t5/Software-Archive/64-bit-integer-multiplication/m-p/1003470#M31102</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;John D. McCalpin wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;The Xeon Phi Instruction Set Architecture manual (document 327364-001, September 2012) does not list any arithmetic instructions for vectors of 64-bit integers.&amp;nbsp; There appears to be a complete set of load/store and Boolean operations for this data type, but no add/subtract/multiply/etc.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Dear John,&lt;/P&gt;

&lt;P&gt;Get it. Thanks very much for the reply.&lt;/P&gt;

&lt;P&gt;Regards,&lt;/P&gt;

&lt;P&gt;Cheng&lt;/P&gt;</description>
      <pubDate>Thu, 17 Jul 2014 22:10:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/64-bit-integer-multiplication/m-p/1003470#M31102</guid>
      <dc:creator>Cheng_C_</dc:creator>
      <dc:date>2014-07-17T22:10:59Z</dc:date>
    </item>
    <item>
      <title>Cheng,</title>
      <link>https://community.intel.com/t5/Software-Archive/64-bit-integer-multiplication/m-p/1003471#M31103</link>
      <description>&lt;P&gt;Cheng,&lt;/P&gt;

&lt;P&gt;As you see from my abstract notation you have at least 12 operations of * and +. Add to this the carry propagation and you would be up to at least 20 operations, but it could be somewhat more.&lt;/P&gt;

&lt;P&gt;This would be an interesting case to study. The scalar code, though it can do it in one operation has the issue of more loads and stores (even if coming from L1)&lt;/P&gt;

&lt;P&gt;Phi: two loads, 20-30 operations two stores (24 to 34 instructions 4xL1 latencies) to produce 8 128-bit results&lt;/P&gt;

&lt;P&gt;Scalar: 16 loads,&amp;nbsp;8 operations, 16 stores (40 instructions 32xL1 latencies) to produce 8 128-bit results&lt;/P&gt;

&lt;P&gt;As ugly as the vector form will be, it still could outperform the scalar code.&lt;/P&gt;

&lt;P&gt;This does assume you have full vectors to work with.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Fri, 18 Jul 2014 13:00:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/64-bit-integer-multiplication/m-p/1003471#M31103</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2014-07-18T13:00:43Z</dc:date>
    </item>
    <item>
      <title>Quote:jimdempseyatthecove</title>
      <link>https://community.intel.com/t5/Software-Archive/64-bit-integer-multiplication/m-p/1003472#M31104</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;jimdempseyatthecove wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Cheng,&lt;/P&gt;

&lt;P&gt;As you see from my abstract notation you have at least 12 operations of * and +. Add to this the carry propagation and you would be up to at least 20 operations, but it could be somewhat more.&lt;/P&gt;

&lt;P&gt;This would be an interesting case to study. The scalar code, though it can do it in one operation has the issue of more loads and stores (even if coming from L1)&lt;/P&gt;

&lt;P&gt;Phi: two loads, 20-30 operations two stores (24 to 34 instructions 4xL1 latencies) to produce 8 128-bit results&lt;/P&gt;

&lt;P&gt;Scalar: 16 loads,&amp;nbsp;8 operations, 16 stores (40 instructions 32xL1 latencies) to produce 8 128-bit results&lt;/P&gt;

&lt;P&gt;As ugly as the vector form will be, it still could outperform the scalar code.&lt;/P&gt;

&lt;P&gt;This does assume you have full vectors to work with.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Jim,&lt;/P&gt;

&lt;P&gt;Thanks very much. This is really helpful. By the way, do all the instructions need the same amount of CPU cycles? Intuitively, I would think mulq needs more CPU cycles than VPMULHD and VPMULLD, as, apparently, it does more jobs.&lt;/P&gt;

&lt;P&gt;Regards,&lt;/P&gt;

&lt;P&gt;Cheng&lt;/P&gt;</description>
      <pubDate>Fri, 18 Jul 2014 13:47:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/64-bit-integer-multiplication/m-p/1003472#M31104</guid>
      <dc:creator>Cheng_C_</dc:creator>
      <dc:date>2014-07-18T13:47:16Z</dc:date>
    </item>
    <item>
      <title>MUL of 64 bit (GP registers),</title>
      <link>https://community.intel.com/t5/Software-Archive/64-bit-integer-multiplication/m-p/1003473#M31105</link>
      <description>&lt;P&gt;MUL of 64 bit (GP registers), depending on architecture, has latency of 10, throughput of 1 (this is on host processor, I do not have the table handy for Xeon Phi, I imagine it is&amp;nbsp;about the&amp;nbsp;same).&lt;/P&gt;

&lt;P&gt;PMULxx on host is latency 3, throughput 1&amp;nbsp; (this is on host processor, I do not have the table handy for Xeon Phi, I imagine it is about&amp;nbsp;the same).&lt;/P&gt;

&lt;P&gt;Assuming you have large arrays of Int64's to multiply, both have throughput of 1 cycle. So with loop unrolling and sufficient registers, you could conceivably hit 1 cycle per multiply, once cycle for non-dependent add. So the deterministic factors would be the number of instructions .AND. the number of L1 (and L2, and RAM latencies). The scalar method has significantly larger L1 latencies. L2 and RAM number of latencies&amp;nbsp;would likely be the same.&lt;/P&gt;

&lt;P&gt;I think rather than philosophizing on the matter, it would be more beneficial to write the code and perform the acid test.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 18 Jul 2014 14:44:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/64-bit-integer-multiplication/m-p/1003473#M31105</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2014-07-18T14:44:00Z</dc:date>
    </item>
    <item>
      <title>Quote:jimdempseyatthecove</title>
      <link>https://community.intel.com/t5/Software-Archive/64-bit-integer-multiplication/m-p/1003474#M31106</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;jimdempseyatthecove wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;MUL of 64 bit (GP registers), depending on architecture, has latency of 10, throughput of 1 (this is on host processor, I do not have the table handy for Xeon Phi, I imagine it is&amp;nbsp;about the&amp;nbsp;same).&lt;/P&gt;

&lt;P&gt;PMULxx on host is latency 3, throughput 1&amp;nbsp; (this is on host processor, I do not have the table handy for Xeon Phi, I imagine it is about&amp;nbsp;the same).&lt;/P&gt;

&lt;P&gt;Assuming you have large arrays of Int64's to multiply, both have throughput of 1 cycle. So with loop unrolling and sufficient registers, you could conceivably hit 1 cycle per multiply, once cycle for non-dependent add. So the deterministic factors would be the number of instructions .AND. the number of L1 (and L2, and RAM latencies). The scalar method has significantly larger L1 latencies. L2 and RAM number of latencies&amp;nbsp;would likely be the same.&lt;/P&gt;

&lt;P&gt;I think rather than philosophizing on the matter, it would be more beneficial to write the code and perform the acid test.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Jim,&lt;/P&gt;

&lt;P&gt;OK. Thanks for all the help and have a nice day.&lt;/P&gt;

&lt;P&gt;Regards,&lt;/P&gt;

&lt;P&gt;Cheng&lt;/P&gt;</description>
      <pubDate>Fri, 18 Jul 2014 15:02:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/64-bit-integer-multiplication/m-p/1003474#M31106</guid>
      <dc:creator>Cheng_C_</dc:creator>
      <dc:date>2014-07-18T15:02:08Z</dc:date>
    </item>
  </channel>
</rss>

