<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic SSE runtime comparison (gcc 4.6.1) in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-runtime-comparison-gcc-4-6-1/m-p/775332#M205</link>
    <description>Thanks Tim for the quick response. Now I am printing the results as well so that compiler can't cheat.&lt;BR /&gt;&lt;BR /&gt;If I use the following compile flag : g++ -Wall -O3 -msse3&lt;BR /&gt;&lt;BR /&gt;SSE version : 2.38&lt;BR /&gt;Non-SSE version : 3.99&lt;BR /&gt;&lt;BR /&gt;So there is clearly 30-40% gain.&lt;BR /&gt;&lt;BR /&gt;Now I added the -ffast-math option. Compile flag : g++ -Wall -O3 -msse3 -ffast-math&lt;BR /&gt;&lt;BR /&gt;SSE version : 2.41&lt;BR /&gt;Non-SSE version : 2.49&lt;BR /&gt;&lt;BR /&gt;If I look into the assembly the function body for Non-SSE version looks very similar. &lt;BR /&gt;&lt;BR /&gt;Using this option, is now gcc also loop unrolling the non-SSE code? &lt;BR /&gt;&lt;BR /&gt;Although the asm of the non-SSE function does not look very different than the non -ffast-math version.&lt;BR /&gt;&lt;BR /&gt;.cfi_startproc&lt;BR /&gt; xorl %r8d, %r8d&lt;BR /&gt; testl %ecx, %ecx&lt;BR /&gt; movq %r8, (%rdi)&lt;BR /&gt; jle .L15&lt;BR /&gt; subl $1, %ecx&lt;BR /&gt; movq %r8, -8(%rsp)&lt;BR /&gt; xorl %eax, %eax&lt;BR /&gt; leaq 8(,%rcx,8), %rcx&lt;BR /&gt; movsd -8(%rsp), %xmm1&lt;BR /&gt; .p2align 4,,10&lt;BR /&gt; .p2align 3&lt;BR /&gt;.L17:&lt;BR /&gt; movsd (%rsi,%rax), %xmm0&lt;BR /&gt; mulsd (%rdx,%rax), %xmm0&lt;BR /&gt; addq $8, %rax&lt;BR /&gt; cmpq %rcx, %rax&lt;BR /&gt; addsd %xmm0, %xmm1&lt;BR /&gt; movsd %xmm1, (%rdi)&lt;BR /&gt; jne .L17&lt;BR /&gt;.L15:&lt;BR /&gt; rep&lt;BR /&gt; ret&lt;BR /&gt; .cfi_endproc&lt;BR /&gt;</description>
    <pubDate>Mon, 08 Aug 2011 15:57:31 GMT</pubDate>
    <dc:creator>debasish83</dc:creator>
    <dc:date>2011-08-08T15:57:31Z</dc:date>
    <item>
      <title>SSE runtime comparison (gcc 4.6.1)</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-runtime-comparison-gcc-4-6-1/m-p/775328#M201</link>
      <description>I was trying these two code snippets. My arrays are all 16 byte aligned&lt;CITE&gt;:&lt;BR /&gt;&lt;BR /&gt;  inline void vecDotSSE(double * s, double * x, double * y, int n)&lt;BR /&gt; {&lt;WBR /&gt; &lt;BR /&gt;
 int ii; &lt;BR /&gt; __m128d XMM0 = _mm_setzero_pd();&lt;BR /&gt; __m128d XMM1 = _mm_setzero_pd();&lt;BR /&gt; __m128d XMM2, XMM3, XMM4, XMM5;&lt;BR /&gt; for (ii = 0;ii &amp;lt; (n);ii += 4)&lt;BR /&gt;
 {&lt;BR /&gt; XMM2 = _mm_load_pd((x)+ii);&lt;BR /&gt; XMM3 = _mm_load_pd((x)+ii+2);&lt;BR /&gt; XMM4 = _mm_load_pd((y)+ii);&lt;BR /&gt; XMM5 = _mm_load_pd((y)+ii+2);&lt;BR /&gt; XMM2 = _mm_mul_pd(XMM2, XMM4);&lt;BR /&gt;
 XMM3 = _mm_mul_pd(XMM3, XMM5);&lt;BR /&gt; XMM0 = _mm_add_pd(XMM0, XMM2);&lt;BR /&gt; XMM1 = _mm_add_pd(XMM1, XMM3);&lt;BR /&gt; }&lt;BR /&gt; XMM0 = _mm_add_pd(XMM0, XMM1);&lt;BR /&gt; XMM1 = _mm_shuffle_pd(XMM0, XMM0, _MM_SHUFFLE2(1, 1));&lt;BR /&gt;
 XMM0 = _mm_add_pd(XMM0, XMM1);&lt;BR /&gt; _mm_store_sd((s), XMM0);&lt;BR /&gt; }&lt;BR /&gt;&lt;BR /&gt; inline void vecDot(double * s, double * x, double * y, int n)&lt;BR /&gt; {&lt;BR /&gt; int i;&lt;BR /&gt; *s = 0.;&lt;BR /&gt;
 for (i = 0;i &amp;lt; n;++i)&lt;BR /&gt; {&lt;BR /&gt; *s += x&lt;I&gt; * y&lt;I&gt;;&lt;BR /&gt; }&lt;BR /&gt; }&lt;BR /&gt;&lt;BR /&gt;My compile flags:&lt;BR /&gt;&lt;BR /&gt;g++ -Wall -O3 -msse3 &lt;BR /&gt;&lt;BR /&gt;These are my runtime numbers on vector of size 1M&lt;BR /&gt;
&lt;BR /&gt;SSE : 0.0263s&lt;BR /&gt;Non-SSE : 1.87996e-07&lt;BR /&gt;&lt;BR /&gt;Does that even make sense ??&lt;BR /&gt;&lt;BR /&gt;I have seen lot of people on the web complaining about the same problem. I will be trying BLAS from ATLAS and Intel MKL as well for SSE blas runtimes.&lt;BR /&gt;&lt;BR /&gt;Did you guys do something on the FPU performance on the new processors ? Seems FPU are much faster than SSE arithmetic cores.&lt;BR /&gt;&lt;BR /&gt;Thanks.&lt;BR /&gt;Deb&lt;/I&gt;&lt;/I&gt;&lt;/CITE&gt;&lt;I&gt;&lt;I&gt;&lt;/I&gt;&lt;/I&gt;</description>
      <pubDate>Mon, 08 Aug 2011 01:28:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-runtime-comparison-gcc-4-6-1/m-p/775328#M201</guid>
      <dc:creator>debasish83</dc:creator>
      <dc:date>2011-08-08T01:28:45Z</dc:date>
    </item>
    <item>
      <title>SSE runtime comparison (gcc 4.6.1)</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-runtime-comparison-gcc-4-6-1/m-p/775329#M202</link>
      <description>I think I understood what's going on. SSE version is not inlined by GCC why non-SSE version is inlined !&lt;BR /&gt;&lt;BR /&gt;Although when I tried not to inline both of them (took the inline out from both the code) then I am noticing comparable runtimes. Unfortunately I am still not noticing good speedup in the SSE code.&lt;BR /&gt;&lt;BR /&gt;I am trying the embree-1.0beta code from Intel. I will update the results from that experiment.&lt;BR /&gt;&lt;BR /&gt;Thanks.&lt;BR /&gt;Deb&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 08 Aug 2011 08:13:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-runtime-comparison-gcc-4-6-1/m-p/775329#M202</guid>
      <dc:creator>debasish83</dc:creator>
      <dc:date>2011-08-08T08:13:37Z</dc:date>
    </item>
    <item>
      <title>SSE runtime comparison (gcc 4.6.1)</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-runtime-comparison-gcc-4-6-1/m-p/775330#M203</link>
      <description>Respective assembly:&lt;BR /&gt;&lt;BR /&gt;vecDotSSE ASM:&lt;BR /&gt;&lt;BR /&gt; .cfi_startproc&lt;BR /&gt; xorpd %xmm2, %xmm2&lt;BR /&gt; testl %ecx, %ecx&lt;BR /&gt; movapd %xmm2, %xmm3&lt;BR /&gt; jle .L11&lt;BR /&gt; subl $1, %ecx&lt;BR /&gt; xorl %eax, %eax&lt;BR /&gt; shrl $2, %ecx&lt;BR /&gt; mov %ecx, %ecx&lt;BR /&gt; addq $1, %rcx&lt;BR /&gt; salq $5, %rcx&lt;BR /&gt; .p2align 4,,10&lt;BR /&gt; .p2align 3&lt;BR /&gt;.L12:&lt;BR /&gt; movapd (%rsi,%rax), %xmm1&lt;BR /&gt; movapd 16(%rsi,%rax), %xmm0&lt;BR /&gt; mulpd (%rdx,%rax), %xmm1&lt;BR /&gt; mulpd 16(%rdx,%rax), %xmm0&lt;BR /&gt; addq $32, %rax&lt;BR /&gt; cmpq %rcx, %rax&lt;BR /&gt; addpd %xmm1, %xmm3&lt;BR /&gt; addpd %xmm0, %xmm2&lt;BR /&gt; jne .L12&lt;BR /&gt;.L11:&lt;BR /&gt; addpd %xmm3, %xmm2&lt;BR /&gt; movapd %xmm2, %xmm0&lt;BR /&gt; unpckhpd %xmm2, %xmm0&lt;BR /&gt; addpd %xmm2, %xmm0&lt;BR /&gt; movlpd %xmm0, (%rdi)&lt;BR /&gt; ret&lt;BR /&gt; .cfi_endproc&lt;BR /&gt;&lt;BR /&gt;vecDot asm:&lt;BR /&gt;&lt;BR /&gt;.cfi_startproc&lt;BR /&gt; xorl %r8d, %r8d&lt;BR /&gt; testl %ecx, %ecx&lt;BR /&gt; movq %r8, (%rdi)&lt;BR /&gt; jle .L15&lt;BR /&gt; subl $1, %ecx&lt;BR /&gt; movq %r8, -8(%rsp)&lt;BR /&gt; xorl %eax, %eax&lt;BR /&gt; leaq 8(,%rcx,8), %rcx&lt;BR /&gt; movsd -8(%rsp), %xmm1&lt;BR /&gt; .p2align 4,,10&lt;BR /&gt; .p2align 3&lt;BR /&gt;.L17:&lt;BR /&gt; movsd (%rsi,%rax), %xmm0&lt;BR /&gt; mulsd (%rdx,%rax), %xmm0&lt;BR /&gt; addq $8, %rax&lt;BR /&gt; cmpq %rcx, %rax&lt;BR /&gt; addsd %xmm0, %xmm1&lt;BR /&gt; movsd %xmm1, (%rdi)&lt;BR /&gt; jne .L17&lt;BR /&gt;.L15:&lt;BR /&gt; rep&lt;BR /&gt; ret&lt;BR /&gt; .cfi_endproc&lt;BR /&gt;&lt;BR /&gt;It seems to me that gcc -O3 is also using SSE optimization. There are all mmx registers inside vecDot code as well !&lt;BR /&gt;&lt;BR /&gt;I will understand the assembly and figure out why the SSE code is behaving badly.</description>
      <pubDate>Mon, 08 Aug 2011 09:50:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-runtime-comparison-gcc-4-6-1/m-p/775330#M203</guid>
      <dc:creator>debasish83</dc:creator>
      <dc:date>2011-08-08T09:50:09Z</dc:date>
    </item>
    <item>
      <title>SSE runtime comparison (gcc 4.6.1)</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-runtime-comparison-gcc-4-6-1/m-p/775331#M204</link>
      <description>You would require -ffast-math to include sum reduction vectorization. Gcc ought to handle this quite well. gcc 4.6 has cleaned up the list of aggressive optimizations under fast-math, so it's safer than similar optimization with icc or older gcc. If the compiler doesn't automatically perform scalar replacement on *s (it should, if you use __restrict pointers), it's simple enough to write that in your source code.&lt;BR /&gt;As for a ridiculously short time; if the compiler can see that you never use the result of a loop, it may optimize it away. This kind of benchmark cheating optimization has been in high demand for decades.</description>
      <pubDate>Mon, 08 Aug 2011 12:15:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-runtime-comparison-gcc-4-6-1/m-p/775331#M204</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2011-08-08T12:15:36Z</dc:date>
    </item>
    <item>
      <title>SSE runtime comparison (gcc 4.6.1)</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-runtime-comparison-gcc-4-6-1/m-p/775332#M205</link>
      <description>Thanks Tim for the quick response. Now I am printing the results as well so that compiler can't cheat.&lt;BR /&gt;&lt;BR /&gt;If I use the following compile flag : g++ -Wall -O3 -msse3&lt;BR /&gt;&lt;BR /&gt;SSE version : 2.38&lt;BR /&gt;Non-SSE version : 3.99&lt;BR /&gt;&lt;BR /&gt;So there is clearly 30-40% gain.&lt;BR /&gt;&lt;BR /&gt;Now I added the -ffast-math option. Compile flag : g++ -Wall -O3 -msse3 -ffast-math&lt;BR /&gt;&lt;BR /&gt;SSE version : 2.41&lt;BR /&gt;Non-SSE version : 2.49&lt;BR /&gt;&lt;BR /&gt;If I look into the assembly the function body for Non-SSE version looks very similar. &lt;BR /&gt;&lt;BR /&gt;Using this option, is now gcc also loop unrolling the non-SSE code? &lt;BR /&gt;&lt;BR /&gt;Although the asm of the non-SSE function does not look very different than the non -ffast-math version.&lt;BR /&gt;&lt;BR /&gt;.cfi_startproc&lt;BR /&gt; xorl %r8d, %r8d&lt;BR /&gt; testl %ecx, %ecx&lt;BR /&gt; movq %r8, (%rdi)&lt;BR /&gt; jle .L15&lt;BR /&gt; subl $1, %ecx&lt;BR /&gt; movq %r8, -8(%rsp)&lt;BR /&gt; xorl %eax, %eax&lt;BR /&gt; leaq 8(,%rcx,8), %rcx&lt;BR /&gt; movsd -8(%rsp), %xmm1&lt;BR /&gt; .p2align 4,,10&lt;BR /&gt; .p2align 3&lt;BR /&gt;.L17:&lt;BR /&gt; movsd (%rsi,%rax), %xmm0&lt;BR /&gt; mulsd (%rdx,%rax), %xmm0&lt;BR /&gt; addq $8, %rax&lt;BR /&gt; cmpq %rcx, %rax&lt;BR /&gt; addsd %xmm0, %xmm1&lt;BR /&gt; movsd %xmm1, (%rdi)&lt;BR /&gt; jne .L17&lt;BR /&gt;.L15:&lt;BR /&gt; rep&lt;BR /&gt; ret&lt;BR /&gt; .cfi_endproc&lt;BR /&gt;</description>
      <pubDate>Mon, 08 Aug 2011 15:57:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-runtime-comparison-gcc-4-6-1/m-p/775332#M205</guid>
      <dc:creator>debasish83</dc:creator>
      <dc:date>2011-08-08T15:57:31Z</dc:date>
    </item>
    <item>
      <title>SSE runtime comparison (gcc 4.6.1)</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-runtime-comparison-gcc-4-6-1/m-p/775333#M206</link>
      <description>Hi Tim,&lt;BR /&gt;&lt;BR /&gt;This -ffast-math is definitely making non-SSE code comparable to SSE. Can you please let me know what exactly in the -ffast-math that's helping the scalar code??&lt;BR /&gt;&lt;BR /&gt;Thanks&lt;BR /&gt;Deb&lt;BR /&gt;</description>
      <pubDate>Mon, 08 Aug 2011 19:19:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-runtime-comparison-gcc-4-6-1/m-p/775333#M206</guid>
      <dc:creator>debasish83</dc:creator>
      <dc:date>2011-08-08T19:19:17Z</dc:date>
    </item>
    <item>
      <title>SSE runtime comparison (gcc 4.6.1)</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-runtime-comparison-gcc-4-6-1/m-p/775334#M207</link>
      <description>I was assuming you were compiling with SSE option and asking gcc to vectorize (-O3). With -ffast-math -O3, gcc includes auto-vectorization of reductions such as you posted, so you would get within 60% of best possible performance for the loop, without changing your C source code. gcc options -ftree-vectorizer-verbose=n (n &amp;gt;=1) will give you some vectorization diagnostics.&lt;BR /&gt;This is equivalent to -fast or #pragma simd reduction() auto-vectorization of icc, with respect to the source code you posted, except that icc will unroll more aggressively to get more performance in the middle range (loop length 100-2000).</description>
      <pubDate>Mon, 08 Aug 2011 20:54:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/SSE-runtime-comparison-gcc-4-6-1/m-p/775334#M207</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2011-08-08T20:54:59Z</dc:date>
    </item>
  </channel>
</rss>

