<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: slow AVX2 instruction when ymm register is used in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/slow-AVX2-instruction-when-ymm-register-is-used/m-p/1533188#M8227</link>
    <description>&lt;P&gt;My findings so far:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;the slowness occurs only on a&amp;nbsp;&lt;EM&gt;performance core&lt;/EM&gt; of the CPU; &lt;EM&gt;efficient cores&lt;/EM&gt;&amp;nbsp;are not affected&lt;/LI&gt;&lt;LI&gt;the slowness occurs only when AVX versions of the instructions are mixed with legacy SSE instructions.&lt;BR /&gt;In the above code, the&amp;nbsp;&lt;EM&gt;movq&lt;/EM&gt; instruction is a legacy one. When the &lt;EM&gt;movq&lt;/EM&gt; instruction is changed to&amp;nbsp;&lt;EM&gt;vmovq&lt;/EM&gt;, the problem disappears.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 12 Oct 2023 17:44:14 GMT</pubDate>
    <dc:creator>Rafaello7</dc:creator>
    <dc:date>2023-10-12T17:44:14Z</dc:date>
    <item>
      <title>slow AVX2 instruction when ymm register is used</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/slow-AVX2-instruction-when-ymm-register-is-used/m-p/1532353#M8226</link>
      <description>&lt;P&gt;The program below shows a HUGE difference in its speed depending on the&amp;nbsp;&lt;EM&gt;vpbroadcastq&lt;/EM&gt; instruction argument. On my computer the program finishes after ~200ms when xmm1 register is used as the destination argument. With ymm register the program execution time increases to about 9 seconds. How is it possible? The CPU is&amp;nbsp;i7-1260P.&amp;nbsp; On another machine, with&amp;nbsp;i3-10110U cpu, the difference is minor.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="cpp"&gt;#include &amp;lt;stdio.h&amp;gt;

unsigned long reverseasm(unsigned long edges) {
    unsigned long res;
    asm (
        "mov $1, %%r8\n"
        "movq %%r8, %%xmm0\n"
#if 0
        "vpbroadcastq %%xmm0, %%xmm1\n"
#else
        "vpbroadcastq %%xmm0, %%ymm1\n"
#endif
        "mov $1, %[res]\n"
        : [res]         "=r"    (res)
        :
        : "r8", "ymm0", "ymm1"
        );
    return res;
}

int main(int argc, char *argv[])
{
    unsigned long sum = 0;
    for(unsigned i = 0; i &amp;lt; 100000000; ++i)
        sum += reverseasm(i);
    printf("sum=%ld\n", sum);
    return 0;
}&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 10 Oct 2023 17:43:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/slow-AVX2-instruction-when-ymm-register-is-used/m-p/1532353#M8226</guid>
      <dc:creator>Rafaello7</dc:creator>
      <dc:date>2023-10-10T17:43:44Z</dc:date>
    </item>
    <item>
      <title>Re: slow AVX2 instruction when ymm register is used</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/slow-AVX2-instruction-when-ymm-register-is-used/m-p/1533188#M8227</link>
      <description>&lt;P&gt;My findings so far:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;the slowness occurs only on a&amp;nbsp;&lt;EM&gt;performance core&lt;/EM&gt; of the CPU; &lt;EM&gt;efficient cores&lt;/EM&gt;&amp;nbsp;are not affected&lt;/LI&gt;&lt;LI&gt;the slowness occurs only when AVX versions of the instructions are mixed with legacy SSE instructions.&lt;BR /&gt;In the above code, the&amp;nbsp;&lt;EM&gt;movq&lt;/EM&gt; instruction is a legacy one. When the &lt;EM&gt;movq&lt;/EM&gt; instruction is changed to&amp;nbsp;&lt;EM&gt;vmovq&lt;/EM&gt;, the problem disappears.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 12 Oct 2023 17:44:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/slow-AVX2-instruction-when-ymm-register-is-used/m-p/1533188#M8227</guid>
      <dc:creator>Rafaello7</dc:creator>
      <dc:date>2023-10-12T17:44:14Z</dc:date>
    </item>
    <item>
      <title>Re: slow AVX2 instruction when ymm register is used</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/slow-AVX2-instruction-when-ymm-register-is-used/m-p/1533442#M8228</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;the explanation is found the Intel Architectures Optimization Reference Manual Volume 1 (&lt;A href="http://www.intel.com/sdm" target="_blank"&gt;www.intel.com/sdm&lt;/A&gt;) section 3.11.6 Instruction Sequence Slowdowns. The Golden Cove performance core eliminated some hardware speed paths when switching from/to SSE and VEX (AVX) and replaced them with microcode.&amp;nbsp; The solutions are: 1) use VEX-encoded instructions (like vmovq) when possible or 2) insert VZEROUPPER.&amp;nbsp; Many more details can be found in the manual.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Roman&lt;/P&gt;</description>
      <pubDate>Fri, 13 Oct 2023 11:17:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/slow-AVX2-instruction-when-ymm-register-is-used/m-p/1533442#M8228</guid>
      <dc:creator>Roman_D_Intel</dc:creator>
      <dc:date>2023-10-13T11:17:22Z</dc:date>
    </item>
    <item>
      <title>Re: slow AVX2 instruction when ymm register is used</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/slow-AVX2-instruction-when-ymm-register-is-used/m-p/1533794#M8229</link>
      <description>&lt;P&gt;Thanks.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I admit, I was not aware that SSE and AVX instructions should not be mixed. I did read the&amp;nbsp;&lt;EM&gt;Software Developer Manual&lt;/EM&gt; and I didn't find any such information. I wondered why all the instructions have two variants - for example the&amp;nbsp;&lt;EM&gt;movq/vmovq. &lt;/EM&gt;Although the vmovq instruction clears the upper part of ymm register, but who cares? Preserving/zeroing the upper ymm part is rarely useful. Programmers can cope without that.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Similarly, I'm wondering why the&amp;nbsp;movupd instruction description states it moves the&amp;nbsp;&lt;EM&gt;double precision floating point values&lt;/EM&gt;? Is it important to have a valid floating point numbers in the registers? Or, can it also be any integer data? Is there any caveat?&lt;/P&gt;</description>
      <pubDate>Sat, 14 Oct 2023 10:10:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/slow-AVX2-instruction-when-ymm-register-is-used/m-p/1533794#M8229</guid>
      <dc:creator>Rafaello7</dc:creator>
      <dc:date>2023-10-14T10:10:52Z</dc:date>
    </item>
    <item>
      <title>Re: slow AVX2 instruction when ymm register is used</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/slow-AVX2-instruction-when-ymm-register-is-used/m-p/1533877#M8230</link>
      <description>&lt;P&gt;Another interesting thing, the&amp;nbsp;&lt;EM&gt;shld&lt;/EM&gt; instruction takes about 11 CPU cycles on an &lt;EM&gt;efficient&lt;/EM&gt; core. On a&lt;I&gt;&amp;nbsp;performance&lt;/I&gt;&amp;nbsp;core&amp;nbsp;it takes one cycle.&lt;/P&gt;</description>
      <pubDate>Sun, 15 Oct 2023 09:53:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/slow-AVX2-instruction-when-ymm-register-is-used/m-p/1533877#M8230</guid>
      <dc:creator>Rafaello7</dc:creator>
      <dc:date>2023-10-15T09:53:43Z</dc:date>
    </item>
  </channel>
</rss>

