<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic SSE vs AVX, kinda deceived in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/SSE-vs-AVX-kinda-deceived/m-p/829792#M1092</link>
    <description>(not sure it's the best place to post this? plz tell me if it should be somewhere else)&lt;BR /&gt;&lt;BR /&gt;I have a programmer's dream in a project:&lt;BR /&gt;-most of the CPU is sucked in one small &amp;amp; simple loop&lt;BR /&gt;-that loop hardly has mem access (writing only, &amp;amp; adding to a buffer that's already in the cache)&lt;BR /&gt;-all the data is in 32-byte aligned 32-byte packs&lt;BR /&gt;&lt;BR /&gt;So it was perfect for a change from SSE1 to AVX. I had a new I7, was in Win7, everything was perfect except for one thing, I'm writing in Delphi which doesn't support AVX (&amp;amp; no intrinsics either, sigh..)&lt;BR /&gt;So I painfully compiled instructions using Flat Assembler, &amp;amp; copied machine code back into Delphi. &lt;BR /&gt;Sadly.. the code was hardly faster, less than 10%. On one hand I could benefit of the 3-register use, and since I replaced 2-register 16byte SSE access by 1-register 32byte AVX access, I got new free (it's 32bit code btw) registers, &amp;amp; thus even less memory access. On the other hand, I probably lost multiple instructions per cycle, because it's less parallel. But I can't believe it's still less than 10% faster..&lt;BR /&gt;Could it be the extra byte per instruction in the AVX version? (quite ironic since one of the key advantages that's listed for AVX is to be "compact").&lt;BR /&gt;&lt;BR /&gt;My loop looked like this:&lt;BR /&gt;&lt;BR /&gt; MOVAPS xmm3,DQWORD PTR TPckTone(EAX).cb&lt;BR /&gt; MOVAPS xmm7,DQWORD PTR TPckTone(EAX+16).cb&lt;BR /&gt; MULPS xmm3,xmm1&lt;BR /&gt; SUBPS xmm3,xmm2 // cos:=cb*snm1-snm2&lt;BR /&gt; MULPS xmm7,xmm5&lt;BR /&gt; SUBPS xmm7,xmm6 // cos:=cb*snm1-snm2&lt;BR /&gt; MOVAPS xmm2,xmm1 // snm2:=snm1&lt;BR /&gt; MOVAPS xmm6,xmm5 // snm2:=snm1&lt;BR /&gt; MOVAPS xmm1,xmm3 // snm1:=cos&lt;BR /&gt; MULPS xmm3,xmm0 // amp&lt;BR /&gt; MOVAPS xmm5,xmm7 // snm1:=cos&lt;BR /&gt; MULPS xmm7,xmm4 // amp&lt;BR /&gt;&lt;BR /&gt; ADDPS xmm3,[EDX]&lt;BR /&gt; ADDPS xmm3,xmm7 // sum outputs&lt;BR /&gt;&lt;BR /&gt; SUBPS xmm0,[EBX] // DQWORD PTR TPckTone(EAX).GoalVol&lt;BR /&gt; MULPS xmm0,[EDI] // DQWORD PTR TPckTone(EAX).VolSpeed&lt;BR /&gt; ADDPS xmm0,[EBX] // DQWORD PTR TPckTone(EAX).GoalVol&lt;BR /&gt; SUBPS xmm4,[ESI] // DQWORD PTR TPckTone(EAX+16).GoalVol&lt;BR /&gt; MULPS xmm4,[EBP] // DQWORD PTR TPckTone(EAX+16).VolSpeed&lt;BR /&gt;  MOVAPS [EDX],xmm3 // output&lt;BR /&gt; ADDPS xmm4,[ESI] // DQWORD PTR TPckTone(EAX+16).GoalVol&lt;BR /&gt;&lt;BR /&gt; ADD EDX,16&lt;BR /&gt; SUB ECX,1&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&amp;amp; became this (I skipped the code around so it's not quite the same as above, but does the same):&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt; DB $C5,$FC,$28,$FA //VMOVAPS ymm7,ymm2&lt;BR /&gt; DB $C5,$DC,$59,$D9 //VMULPS ymm3,ymm4,ymm1&lt;BR /&gt; DB $C5,$FC,$28,$D1 //VMOVAPS ymm2,ymm1 // snm2:=snm1&lt;BR /&gt; DB $C5,$E4,$5C,$CF //VSUBPS ymm1,ymm3,ymm7 // cos:=cb*snm1-snm2&lt;BR /&gt; DB $C5,$F4,$59,$D8 //VMULPS ymm3,ymm1,ymm0 // amp&lt;BR /&gt;&lt;BR /&gt; // add low &amp;amp; high, then output&lt;BR /&gt; DB $C4,$E3,$7D,$19,$DF,$01 //VEXTRACTF128 xmm7,ymm3,1&lt;BR /&gt; DB $C5,$FC,$5C,$C5 //VSUBPS ymm0,ymm0,ymm5 // TPckTone(EAX).GoalVol&lt;BR /&gt; DB $C5,$E0,$58,$DF //VADDPS xmm3,xmm3,xmm7&lt;BR /&gt; DB $C5,$FC,$59,$C6 //VMULPS ymm0,ymm0,ymm6 // TPckTone(EAX).VolSpeed&lt;BR /&gt; DB $C5,$E0,$58,$1A //VADDPS xmm3,xmm3,[EDX]&lt;BR /&gt; DB $C5,$FC,$58,$C5 //VADDPS ymm0,ymm0,ymm5 // TPckTone(EAX).GoalVol&lt;BR /&gt; DB $C5,$F8,$29,$1A //VMOVAPS [EDX],xmm3&lt;BR /&gt;&lt;BR /&gt; ADD EDX,16&lt;BR /&gt; SUB ECX,1&lt;BR /&gt;&lt;BR /&gt;(attempts at unrolling &amp;amp; reordering lines didn't really succeed. The 3 lines I mixed at the end don't even make a difference actually, while I was expecting it to reduce the latencies)&lt;BR /&gt;&lt;BR /&gt;Now, instead of processing 8 32bit floats at once, I could try to make bigger changes in my code &amp;amp; process 16 at once, thus I'd have exactly the same as the original code, but using ymm registers. But I wonder if it would really improve the speed.. it's quite tedious &amp;amp; it'd be sad if it didn't.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Also, the more I try to understand the rules to end up with multiple instructions per cycle, the less I do. In the past it seemed pretty simple, you could pair operations that were totally independent. But I don't understand how it works these days. For ex at the top you see&lt;BR /&gt; MULPS xmm3,xmm1&lt;BR /&gt; SUBPS xmm3,xmm2 // cos:=cb*snm1-snm2&lt;BR /&gt; MULPS xmm7,xmm5&lt;BR /&gt; SUBPS xmm7,xmm6 // cos:=cb*snm1-snm2&lt;BR /&gt;which I would normally write like this&lt;BR /&gt; MULPS xmm3,xmm1&lt;BR /&gt; MULPS xmm7,xmm5&lt;BR /&gt; SUBPS xmm3,xmm2 // cos:=cb*snm1-snm2&lt;BR /&gt; SUBPS xmm7,xmm6 // cos:=cb*snm1-snm2&lt;BR /&gt;..but if I left it as the above version, it was because I reordered stuff at random (what I always do these days) &amp;amp; kept the fastest. But I don't understand it. The same way, the first SSE1 code was totally different &amp;amp; should have worked faster, had no memory reads &amp;amp; was all using registers, but no, it was slower.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;(Edit: thanks to the 3 operants/instructions, I could simplify these 3, and it's faster, but for "other reasons")&lt;BR /&gt;DB $C5,$FC,$5C,$C5 //VSUBPS ymm0,ymm0,ymm5 // TPckTone(EAX).GoalVol&lt;BR /&gt;DB $C5,$FC,$59,$C6 //VMULPS ymm0,ymm0,ymm6 // TPckTone(EAX).VolSpeed&lt;BR /&gt;DB $C5,$FC,$58,$C5 //VADDPS ymm0,ymm0,ymm5 // TPckTone(EAX).GoalVol</description>
    <pubDate>Sat, 05 May 2012 16:19:27 GMT</pubDate>
    <dc:creator>gol</dc:creator>
    <dc:date>2012-05-05T16:19:27Z</dc:date>
    <item>
      <title>SSE vs AVX, kinda deceived</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/SSE-vs-AVX-kinda-deceived/m-p/829792#M1092</link>
      <description>(not sure it's the best place to post this? plz tell me if it should be somewhere else)&lt;BR /&gt;&lt;BR /&gt;I have a programmer's dream in a project:&lt;BR /&gt;-most of the CPU is sucked in one small &amp;amp; simple loop&lt;BR /&gt;-that loop hardly has mem access (writing only, &amp;amp; adding to a buffer that's already in the cache)&lt;BR /&gt;-all the data is in 32-byte aligned 32-byte packs&lt;BR /&gt;&lt;BR /&gt;So it was perfect for a change from SSE1 to AVX. I had a new I7, was in Win7, everything was perfect except for one thing, I'm writing in Delphi which doesn't support AVX (&amp;amp; no intrinsics either, sigh..)&lt;BR /&gt;So I painfully compiled instructions using Flat Assembler, &amp;amp; copied machine code back into Delphi. &lt;BR /&gt;Sadly.. the code was hardly faster, less than 10%. On one hand I could benefit of the 3-register use, and since I replaced 2-register 16byte SSE access by 1-register 32byte AVX access, I got new free (it's 32bit code btw) registers, &amp;amp; thus even less memory access. On the other hand, I probably lost multiple instructions per cycle, because it's less parallel. But I can't believe it's still less than 10% faster..&lt;BR /&gt;Could it be the extra byte per instruction in the AVX version? (quite ironic since one of the key advantages that's listed for AVX is to be "compact").&lt;BR /&gt;&lt;BR /&gt;My loop looked like this:&lt;BR /&gt;&lt;BR /&gt; MOVAPS xmm3,DQWORD PTR TPckTone(EAX).cb&lt;BR /&gt; MOVAPS xmm7,DQWORD PTR TPckTone(EAX+16).cb&lt;BR /&gt; MULPS xmm3,xmm1&lt;BR /&gt; SUBPS xmm3,xmm2 // cos:=cb*snm1-snm2&lt;BR /&gt; MULPS xmm7,xmm5&lt;BR /&gt; SUBPS xmm7,xmm6 // cos:=cb*snm1-snm2&lt;BR /&gt; MOVAPS xmm2,xmm1 // snm2:=snm1&lt;BR /&gt; MOVAPS xmm6,xmm5 // snm2:=snm1&lt;BR /&gt; MOVAPS xmm1,xmm3 // snm1:=cos&lt;BR /&gt; MULPS xmm3,xmm0 // amp&lt;BR /&gt; MOVAPS xmm5,xmm7 // snm1:=cos&lt;BR /&gt; MULPS xmm7,xmm4 // amp&lt;BR /&gt;&lt;BR /&gt; ADDPS xmm3,[EDX]&lt;BR /&gt; ADDPS xmm3,xmm7 // sum outputs&lt;BR /&gt;&lt;BR /&gt; SUBPS xmm0,[EBX] // DQWORD PTR TPckTone(EAX).GoalVol&lt;BR /&gt; MULPS xmm0,[EDI] // DQWORD PTR TPckTone(EAX).VolSpeed&lt;BR /&gt; ADDPS xmm0,[EBX] // DQWORD PTR TPckTone(EAX).GoalVol&lt;BR /&gt; SUBPS xmm4,[ESI] // DQWORD PTR TPckTone(EAX+16).GoalVol&lt;BR /&gt; MULPS xmm4,[EBP] // DQWORD PTR TPckTone(EAX+16).VolSpeed&lt;BR /&gt;  MOVAPS [EDX],xmm3 // output&lt;BR /&gt; ADDPS xmm4,[ESI] // DQWORD PTR TPckTone(EAX+16).GoalVol&lt;BR /&gt;&lt;BR /&gt; ADD EDX,16&lt;BR /&gt; SUB ECX,1&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&amp;amp; became this (I skipped the code around so it's not quite the same as above, but does the same):&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt; DB $C5,$FC,$28,$FA //VMOVAPS ymm7,ymm2&lt;BR /&gt; DB $C5,$DC,$59,$D9 //VMULPS ymm3,ymm4,ymm1&lt;BR /&gt; DB $C5,$FC,$28,$D1 //VMOVAPS ymm2,ymm1 // snm2:=snm1&lt;BR /&gt; DB $C5,$E4,$5C,$CF //VSUBPS ymm1,ymm3,ymm7 // cos:=cb*snm1-snm2&lt;BR /&gt; DB $C5,$F4,$59,$D8 //VMULPS ymm3,ymm1,ymm0 // amp&lt;BR /&gt;&lt;BR /&gt; // add low &amp;amp; high, then output&lt;BR /&gt; DB $C4,$E3,$7D,$19,$DF,$01 //VEXTRACTF128 xmm7,ymm3,1&lt;BR /&gt; DB $C5,$FC,$5C,$C5 //VSUBPS ymm0,ymm0,ymm5 // TPckTone(EAX).GoalVol&lt;BR /&gt; DB $C5,$E0,$58,$DF //VADDPS xmm3,xmm3,xmm7&lt;BR /&gt; DB $C5,$FC,$59,$C6 //VMULPS ymm0,ymm0,ymm6 // TPckTone(EAX).VolSpeed&lt;BR /&gt; DB $C5,$E0,$58,$1A //VADDPS xmm3,xmm3,[EDX]&lt;BR /&gt; DB $C5,$FC,$58,$C5 //VADDPS ymm0,ymm0,ymm5 // TPckTone(EAX).GoalVol&lt;BR /&gt; DB $C5,$F8,$29,$1A //VMOVAPS [EDX],xmm3&lt;BR /&gt;&lt;BR /&gt; ADD EDX,16&lt;BR /&gt; SUB ECX,1&lt;BR /&gt;&lt;BR /&gt;(attempts at unrolling &amp;amp; reordering lines didn't really succeed. The 3 lines I mixed at the end don't even make a difference actually, while I was expecting it to reduce the latencies)&lt;BR /&gt;&lt;BR /&gt;Now, instead of processing 8 32bit floats at once, I could try to make bigger changes in my code &amp;amp; process 16 at once, thus I'd have exactly the same as the original code, but using ymm registers. But I wonder if it would really improve the speed.. it's quite tedious &amp;amp; it'd be sad if it didn't.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Also, the more I try to understand the rules to end up with multiple instructions per cycle, the less I do. In the past it seemed pretty simple, you could pair operations that were totally independent. But I don't understand how it works these days. For ex at the top you see&lt;BR /&gt; MULPS xmm3,xmm1&lt;BR /&gt; SUBPS xmm3,xmm2 // cos:=cb*snm1-snm2&lt;BR /&gt; MULPS xmm7,xmm5&lt;BR /&gt; SUBPS xmm7,xmm6 // cos:=cb*snm1-snm2&lt;BR /&gt;which I would normally write like this&lt;BR /&gt; MULPS xmm3,xmm1&lt;BR /&gt; MULPS xmm7,xmm5&lt;BR /&gt; SUBPS xmm3,xmm2 // cos:=cb*snm1-snm2&lt;BR /&gt; SUBPS xmm7,xmm6 // cos:=cb*snm1-snm2&lt;BR /&gt;..but if I left it as the above version, it was because I reordered stuff at random (what I always do these days) &amp;amp; kept the fastest. But I don't understand it. The same way, the first SSE1 code was totally different &amp;amp; should have worked faster, had no memory reads &amp;amp; was all using registers, but no, it was slower.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;(Edit: thanks to the 3 operants/instructions, I could simplify these 3, and it's faster, but for "other reasons")&lt;BR /&gt;DB $C5,$FC,$5C,$C5 //VSUBPS ymm0,ymm0,ymm5 // TPckTone(EAX).GoalVol&lt;BR /&gt;DB $C5,$FC,$59,$C6 //VMULPS ymm0,ymm0,ymm6 // TPckTone(EAX).VolSpeed&lt;BR /&gt;DB $C5,$FC,$58,$C5 //VADDPS ymm0,ymm0,ymm5 // TPckTone(EAX).GoalVol</description>
      <pubDate>Sat, 05 May 2012 16:19:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/SSE-vs-AVX-kinda-deceived/m-p/829792#M1092</guid>
      <dc:creator>gol</dc:creator>
      <dc:date>2012-05-05T16:19:27Z</dc:date>
    </item>
    <item>
      <title>SSE vs AVX, kinda deceived</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/SSE-vs-AVX-kinda-deceived/m-p/829793#M1093</link>
      <description>I suppose you'd have to count micro-ops to compare effective code size at execution. If you got more fusion in the SSE code, it could contribute to minimizing the difference in run time; as you've already pointed out, multiple issue also contributes.&lt;BR /&gt;You don't show what you do about loop body code alignment. I've noticed that Intel compilers take more care about it when the AVX option is set. Without specified alignments, you may get full performance, but more luck is involved.&lt;BR /&gt;If your data miss L1, you get no advantage for attempting to access 256 bits on one cycle, regardless of whether you do it with 1 or 2 instructions.</description>
      <pubDate>Tue, 08 May 2012 15:44:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/SSE-vs-AVX-kinda-deceived/m-p/829793#M1093</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2012-05-08T15:44:55Z</dc:date>
    </item>
    <item>
      <title>SSE vs AVX, kinda deceived</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/SSE-vs-AVX-kinda-deceived/m-p/829794#M1094</link>
      <description>&amp;gt;&amp;gt;You don't show what you do about loop body code alignment&lt;BR /&gt;&lt;BR /&gt;I align the loop to 16bytes, but it makes no difference. &lt;BR /&gt;(actually I do align in Delphi XE2 only (but I still use Delphi 2007), can you believe that code alignment is a "new feature" in Delphi?)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&amp;gt;&amp;gt;If your data miss L1, you get no advantage for attempting to access 256 
bits on one cycle, regardless of &amp;gt;&amp;gt;whether you do it with 1 or 2 
instructions.&lt;BR /&gt;&lt;BR /&gt;Well in this loop I only have one read/write, and weirdly it's not what's eating the most (at all). I also prefetch it, so that it's ready when I need to read it (but it's normally in the cache already). &lt;BR /&gt;Prefetching is too something I understand in theory, but not in practice. I never ever got any boost by prefetching.&lt;BR /&gt;&lt;BR /&gt;So.. most likely it really is that the SSE2 code has more instructions/cycle possible. Sadly in order to port the same one to AVX, I'd have to reorganize all my data so that it works in packs of 16 instead of 8. &lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;(is it me or quoting doesn't work in the toolbar?)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 09 May 2012 03:46:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/SSE-vs-AVX-kinda-deceived/m-p/829794#M1094</guid>
      <dc:creator>gol</dc:creator>
      <dc:date>2012-05-09T03:46:13Z</dc:date>
    </item>
  </channel>
</rss>

