<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Optimizing SSE2 code and beyond... in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimizing-SSE2-code-and-beyond/m-p/877812#M2483</link>
    <description>Given an flow of SSE2 instructions on Linux x86_64 Intel 5345 processor as below -&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt; &lt;STRONG&gt;---------------(a)--------------&lt;/STRONG&gt;&lt;BR /&gt;"movaps %xmm5, %xmm12        \n\t"&lt;BR /&gt;"mulsd               %xmm15, %xmm12       \n\t"&lt;BR /&gt;"addsd %xmm2, %xmm12        \n\t"&lt;BR /&gt;&lt;BR /&gt;"movaps              %xmm9, %xmm0        \n\t"&lt;BR /&gt;"mulsd %xmm14, %xmm0        \n\t"&lt;BR /&gt;"addsd              %xmm0, %xmm12          \n\t"&lt;BR /&gt;&lt;BR /&gt;"movaps %xmm11, %xmm0       \n\t"&lt;BR /&gt;"mulsd %xmm13, %xmm0        \n\t"&lt;BR /&gt;"addsd  %xmm0, %xmm12 \n\t"&lt;BR /&gt;&lt;BR /&gt;"cvtsd2ss  %xmm12, %xmm12  \n\t"&lt;BR /&gt;"movss %xmm12, (%r10,%rdi) \n\t"&lt;BR /&gt;----------------------------------&lt;BR /&gt;&lt;BR /&gt;for section of code as -&lt;BR /&gt; -------------&lt;BR /&gt;&lt;STRONG&gt;crd[apple]&lt;X&gt; = (double)crdhello&lt;X&gt; + d&lt;X&gt; * k&lt;X&gt;&lt;X&gt; + d&lt;Y&gt; * k&lt;X&gt;&lt;Y&gt; + d&lt;Z&gt; * k&lt;X&gt;&lt;Z&gt;;&lt;BR /&gt;&lt;/Z&gt;&lt;/X&gt;&lt;/Z&gt;&lt;/Y&gt;&lt;/X&gt;&lt;/Y&gt;&lt;/X&gt;&lt;/X&gt;&lt;/X&gt;&lt;/X&gt;&lt;/X&gt;&lt;/STRONG&gt; -------------&lt;BR /&gt;&lt;BR /&gt;The above pattern is for "d&lt;X&gt; * k&lt;X&gt;&lt;X&gt;" followed by "d&lt;Y&gt; * k&lt;X&gt;&lt;Y&gt;" and finally by "d&lt;Z&gt; * k&lt;X&gt;&lt;Z&gt;" respectively.&lt;BR /&gt;&lt;BR /&gt;Similarly for -&lt;BR /&gt;&lt;BR /&gt; &lt;STRONG&gt;------------------------(b)-------------------&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;crd[apple]&lt;Y&gt; = (double)crdhello&lt;Y&gt; + d&lt;X&gt; * k&lt;Y&gt;&lt;X&gt; + d&lt;Y&gt; * k&lt;Y&gt;&lt;Y&gt; + d&lt;Z&gt; * k&lt;Y&gt;&lt;Z&gt;;&lt;BR /&gt;&lt;/Z&gt;&lt;/Y&gt;&lt;/Z&gt;&lt;/Y&gt;&lt;/Y&gt;&lt;/Y&gt;&lt;/X&gt;&lt;/Y&gt;&lt;/X&gt;&lt;/Y&gt;&lt;/Y&gt;&lt;/STRONG&gt;&lt;BR /&gt;whose respective pattern of Inline asm is -&lt;BR /&gt;&lt;BR /&gt;--------------&lt;BR /&gt;"movsd  40(%rsp), %xmm0  \n\t"&lt;BR /&gt;"mulsd               %xmm15, %xmm0  \n\t"&lt;BR /&gt;"addsd%xmm4, %xmm0 \n\t"&lt;BR /&gt;&lt;BR /&gt;"movaps              %xmm6, %xmm12  \n\t"&lt;BR /&gt;"mulsd  %xmm14, %xmm12  \n\t"&lt;BR /&gt;"addsd               %xmm12, %xmm0  \n\t"&lt;BR /&gt;&lt;BR /&gt;"movaps %xmm7, %xmm12  \n\t"&lt;BR /&gt;"mulsd %xmm13, %xmm12 \n\t"&lt;BR /&gt;"addsd %xmm12, %xmm0  \n\t"&lt;BR /&gt;&lt;BR /&gt;"cvtsd2ss %xmm0, %xmm0  \n\t"&lt;BR /&gt;"movss %xmm0, 4(%r10,%rdi)       \n\t"&lt;BR /&gt; &lt;STRONG&gt;---------------------------&lt;BR /&gt;&lt;/STRONG&gt;&lt;BR /&gt;and for the last pattern which is -&lt;BR /&gt;&lt;BR /&gt; &lt;STRONG&gt;-------------(c)------------&lt;BR /&gt;crd[apple]&lt;Z&gt; = (double)crdhello&lt;Z&gt; + d&lt;X&gt; * k&lt;Z&gt;&lt;X&gt; + d&lt;Y&gt; * k&lt;Z&gt;&lt;Y&gt; + d&lt;Z&gt; * k&lt;Z&gt;&lt;Z&gt;;&lt;/Z&gt;&lt;/Z&gt;&lt;/Z&gt;&lt;/Y&gt;&lt;/Z&gt;&lt;/Y&gt;&lt;/X&gt;&lt;/Z&gt;&lt;/X&gt;&lt;/Z&gt;&lt;/Z&gt;&lt;/STRONG&gt;&lt;BR /&gt;------&lt;BR /&gt;&lt;BR /&gt;the Inline asm is -&lt;BR /&gt;&lt;BR /&gt; ------------------------------&lt;BR /&gt;"mulsd               %xmm8, %xmm15  \n\t"&lt;BR /&gt;"addsd  %xmm3, %xmm15  \n\t"&lt;BR /&gt;&lt;BR /&gt;"mulsd %xmm10, %xmm14\n\t"&lt;BR /&gt;"addsd               %xmm14, %xmm15 \n\t"&lt;BR /&gt;&lt;BR /&gt;"mulsd %xmm1, %xmm13  \n\t"&lt;BR /&gt;"addsd  %xmm13, %xmm15  \n\t"&lt;BR /&gt;&lt;BR /&gt;"cvtsd2ss %xmm15, %xmm13  \n\t"&lt;BR /&gt;"movss %xmm13, 8(%r10,%rdi)       \n\t"&lt;BR /&gt;---&lt;BR /&gt;&lt;BR /&gt;I see that in &lt;STRONG&gt;(b)&lt;/STRONG&gt; alignment haven't been done as &lt;STRONG&gt;"movsd 40(%rsp), %xmm0" &lt;/STRONG&gt;has been called. Moreover in&lt;STRONG&gt; (c)&lt;/STRONG&gt; none of the SSE2 alignment instructions like &lt;STRONG&gt;movaps/movapd/movdqa or movups/movupd/movdqu&lt;/STRONG&gt; are being called. Probably since only three parameters(X, Y, Z)exist here, could be the reason.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Suggestionsneeded:&lt;BR /&gt;(i)&lt;/STRONG&gt; Can call of "&lt;STRONG&gt;movsd 40(%rsp), %xmm0&lt;/STRONG&gt;" is correct from optimization point of view or it should be replaced with alignment SSE instructions call?&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;(ii)&lt;/STRONG&gt; Could above patterns for &lt;STRONG&gt;(a), (b)&amp;amp; (c)&lt;/STRONG&gt;be more optimized (speed-up) with some other SSE instructions OR replaced by SSE3 or SSSE3 instructions. If YES, can a pattern of SSE3/SSSE3 which instructions be used to replace above SSE2 instructions?&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;(iii)&lt;/STRONG&gt; Since here the algorithm has 3 parameters and asm beingrepresented only for these 3 parameters. Do I need to generate a dummy asm representation of instructions for 4th. parameter (say W) which has void contents to maintain the DP FP alignment and effective vectorization?&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;~BR&lt;/Z&gt;&lt;/X&gt;&lt;/Z&gt;&lt;/Y&gt;&lt;/X&gt;&lt;/Y&gt;&lt;/X&gt;&lt;/X&gt;&lt;/X&gt;</description>
    <pubDate>Fri, 04 Sep 2009 11:58:34 GMT</pubDate>
    <dc:creator>srimks</dc:creator>
    <dc:date>2009-09-04T11:58:34Z</dc:date>
    <item>
      <title>Optimizing SSE2 code and beyond...</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimizing-SSE2-code-and-beyond/m-p/877812#M2483</link>
      <description>Given an flow of SSE2 instructions on Linux x86_64 Intel 5345 processor as below -&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt; &lt;STRONG&gt;---------------(a)--------------&lt;/STRONG&gt;&lt;BR /&gt;"movaps %xmm5, %xmm12        \n\t"&lt;BR /&gt;"mulsd               %xmm15, %xmm12       \n\t"&lt;BR /&gt;"addsd %xmm2, %xmm12        \n\t"&lt;BR /&gt;&lt;BR /&gt;"movaps              %xmm9, %xmm0        \n\t"&lt;BR /&gt;"mulsd %xmm14, %xmm0        \n\t"&lt;BR /&gt;"addsd              %xmm0, %xmm12          \n\t"&lt;BR /&gt;&lt;BR /&gt;"movaps %xmm11, %xmm0       \n\t"&lt;BR /&gt;"mulsd %xmm13, %xmm0        \n\t"&lt;BR /&gt;"addsd  %xmm0, %xmm12 \n\t"&lt;BR /&gt;&lt;BR /&gt;"cvtsd2ss  %xmm12, %xmm12  \n\t"&lt;BR /&gt;"movss %xmm12, (%r10,%rdi) \n\t"&lt;BR /&gt;----------------------------------&lt;BR /&gt;&lt;BR /&gt;for section of code as -&lt;BR /&gt; -------------&lt;BR /&gt;&lt;STRONG&gt;crd[apple]&lt;X&gt; = (double)crdhello&lt;X&gt; + d&lt;X&gt; * k&lt;X&gt;&lt;X&gt; + d&lt;Y&gt; * k&lt;X&gt;&lt;Y&gt; + d&lt;Z&gt; * k&lt;X&gt;&lt;Z&gt;;&lt;BR /&gt;&lt;/Z&gt;&lt;/X&gt;&lt;/Z&gt;&lt;/Y&gt;&lt;/X&gt;&lt;/Y&gt;&lt;/X&gt;&lt;/X&gt;&lt;/X&gt;&lt;/X&gt;&lt;/X&gt;&lt;/STRONG&gt; -------------&lt;BR /&gt;&lt;BR /&gt;The above pattern is for "d&lt;X&gt; * k&lt;X&gt;&lt;X&gt;" followed by "d&lt;Y&gt; * k&lt;X&gt;&lt;Y&gt;" and finally by "d&lt;Z&gt; * k&lt;X&gt;&lt;Z&gt;" respectively.&lt;BR /&gt;&lt;BR /&gt;Similarly for -&lt;BR /&gt;&lt;BR /&gt; &lt;STRONG&gt;------------------------(b)-------------------&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;crd[apple]&lt;Y&gt; = (double)crdhello&lt;Y&gt; + d&lt;X&gt; * k&lt;Y&gt;&lt;X&gt; + d&lt;Y&gt; * k&lt;Y&gt;&lt;Y&gt; + d&lt;Z&gt; * k&lt;Y&gt;&lt;Z&gt;;&lt;BR /&gt;&lt;/Z&gt;&lt;/Y&gt;&lt;/Z&gt;&lt;/Y&gt;&lt;/Y&gt;&lt;/Y&gt;&lt;/X&gt;&lt;/Y&gt;&lt;/X&gt;&lt;/Y&gt;&lt;/Y&gt;&lt;/STRONG&gt;&lt;BR /&gt;whose respective pattern of Inline asm is -&lt;BR /&gt;&lt;BR /&gt;--------------&lt;BR /&gt;"movsd  40(%rsp), %xmm0  \n\t"&lt;BR /&gt;"mulsd               %xmm15, %xmm0  \n\t"&lt;BR /&gt;"addsd%xmm4, %xmm0 \n\t"&lt;BR /&gt;&lt;BR /&gt;"movaps              %xmm6, %xmm12  \n\t"&lt;BR /&gt;"mulsd  %xmm14, %xmm12  \n\t"&lt;BR /&gt;"addsd               %xmm12, %xmm0  \n\t"&lt;BR /&gt;&lt;BR /&gt;"movaps %xmm7, %xmm12  \n\t"&lt;BR /&gt;"mulsd %xmm13, %xmm12 \n\t"&lt;BR /&gt;"addsd %xmm12, %xmm0  \n\t"&lt;BR /&gt;&lt;BR /&gt;"cvtsd2ss %xmm0, %xmm0  \n\t"&lt;BR /&gt;"movss %xmm0, 4(%r10,%rdi)       \n\t"&lt;BR /&gt; &lt;STRONG&gt;---------------------------&lt;BR /&gt;&lt;/STRONG&gt;&lt;BR /&gt;and for the last pattern which is -&lt;BR /&gt;&lt;BR /&gt; &lt;STRONG&gt;-------------(c)------------&lt;BR /&gt;crd[apple]&lt;Z&gt; = (double)crdhello&lt;Z&gt; + d&lt;X&gt; * k&lt;Z&gt;&lt;X&gt; + d&lt;Y&gt; * k&lt;Z&gt;&lt;Y&gt; + d&lt;Z&gt; * k&lt;Z&gt;&lt;Z&gt;;&lt;/Z&gt;&lt;/Z&gt;&lt;/Z&gt;&lt;/Y&gt;&lt;/Z&gt;&lt;/Y&gt;&lt;/X&gt;&lt;/Z&gt;&lt;/X&gt;&lt;/Z&gt;&lt;/Z&gt;&lt;/STRONG&gt;&lt;BR /&gt;------&lt;BR /&gt;&lt;BR /&gt;the Inline asm is -&lt;BR /&gt;&lt;BR /&gt; ------------------------------&lt;BR /&gt;"mulsd               %xmm8, %xmm15  \n\t"&lt;BR /&gt;"addsd  %xmm3, %xmm15  \n\t"&lt;BR /&gt;&lt;BR /&gt;"mulsd %xmm10, %xmm14\n\t"&lt;BR /&gt;"addsd               %xmm14, %xmm15 \n\t"&lt;BR /&gt;&lt;BR /&gt;"mulsd %xmm1, %xmm13  \n\t"&lt;BR /&gt;"addsd  %xmm13, %xmm15  \n\t"&lt;BR /&gt;&lt;BR /&gt;"cvtsd2ss %xmm15, %xmm13  \n\t"&lt;BR /&gt;"movss %xmm13, 8(%r10,%rdi)       \n\t"&lt;BR /&gt;---&lt;BR /&gt;&lt;BR /&gt;I see that in &lt;STRONG&gt;(b)&lt;/STRONG&gt; alignment haven't been done as &lt;STRONG&gt;"movsd 40(%rsp), %xmm0" &lt;/STRONG&gt;has been called. Moreover in&lt;STRONG&gt; (c)&lt;/STRONG&gt; none of the SSE2 alignment instructions like &lt;STRONG&gt;movaps/movapd/movdqa or movups/movupd/movdqu&lt;/STRONG&gt; are being called. Probably since only three parameters(X, Y, Z)exist here, could be the reason.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Suggestionsneeded:&lt;BR /&gt;(i)&lt;/STRONG&gt; Can call of "&lt;STRONG&gt;movsd 40(%rsp), %xmm0&lt;/STRONG&gt;" is correct from optimization point of view or it should be replaced with alignment SSE instructions call?&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;(ii)&lt;/STRONG&gt; Could above patterns for &lt;STRONG&gt;(a), (b)&amp;amp; (c)&lt;/STRONG&gt;be more optimized (speed-up) with some other SSE instructions OR replaced by SSE3 or SSSE3 instructions. If YES, can a pattern of SSE3/SSSE3 which instructions be used to replace above SSE2 instructions?&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;(iii)&lt;/STRONG&gt; Since here the algorithm has 3 parameters and asm beingrepresented only for these 3 parameters. Do I need to generate a dummy asm representation of instructions for 4th. parameter (say W) which has void contents to maintain the DP FP alignment and effective vectorization?&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;~BR&lt;/Z&gt;&lt;/X&gt;&lt;/Z&gt;&lt;/Y&gt;&lt;/X&gt;&lt;/Y&gt;&lt;/X&gt;&lt;/X&gt;&lt;/X&gt;</description>
      <pubDate>Fri, 04 Sep 2009 11:58:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimizing-SSE2-code-and-beyond/m-p/877812#M2483</guid>
      <dc:creator>srimks</dc:creator>
      <dc:date>2009-09-04T11:58:34Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing SSE2 code and beyond...</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Optimizing-SSE2-code-and-beyond/m-p/877813#M2484</link>
      <description>In continuation, didhad togenerate asm for algorithmof X, Y, Z parameters since the original C/C++ codehas been writtenin such a way that it fails to add address MCA (multi-core achitecture) design needs which means if I have 4th. parameter as a local scopewithin the file than optimization can be done by taking care of alignment and DP FP 2 or 4 vectorization. &lt;BR /&gt;&lt;BR /&gt;So looking for some suggestions for above (i), (ii) and (iii) queries.&lt;BR /&gt;</description>
      <pubDate>Sat, 05 Sep 2009 03:02:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Optimizing-SSE2-code-and-beyond/m-p/877813#M2484</guid>
      <dc:creator>srimks</dc:creator>
      <dc:date>2009-09-05T03:02:34Z</dc:date>
    </item>
  </channel>
</rss>

