<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic AVX512 auto-vectorization on i9-7900X in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX512-auto-vectorization-on-i9-7900X/m-p/1150748#M6471</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;

&lt;P&gt;I wrote a small piece of code to test the auto-vectorization options of the icc compiler. The code sums 2 arrays (vectors) of doubles and stores it in a 3rd array. I tried to compile it with -xCORE-AVX2 and -xCORE-AVX512 and was expecting a portion of the 2x possible theoretical maximum speedup. But, what I saw was almost the same execution time for both versions (sometimes even worse). At first I thought the cause is the size of the array which doesn't fit in the L1, but when I repeated the runs with arrays size of 256 numbers, I got the same speedup. I noticed the same effect with a series of other scientific apps, and I just can't accept the fact that for every single app I tried, I can't get any speedup at all (unless the app was precompiled somwhere).&lt;/P&gt;

&lt;P&gt;So I checked the objdump of the AVX512 generated binary for my toy program and noticed that there was no usage of zmm registers in the hot loop. However, when I used an online compiler explorer (https://godbolt.org/#) I could see beautiful AVX512 code which I expected to see on my machine as well. The funny thing is that I even use a more recent version of icc compiler, and I just can't get it to produce good code. What could be the issue here? Are there any hints I have to give the compiler? That doesn't seem to bother the online compiler I tested. I tried using different pragmas, the trip counts are known in advance, arrays are aligned, qopt-report says it vectorized the code.&lt;/P&gt;

&lt;P&gt;Any suggestion would be helpful.&lt;/P&gt;

&lt;P&gt;Here is the C code for the toy app.&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#include &amp;lt;stdio.h&amp;gt;
#include &amp;lt;stdlib.h&amp;gt;
#include &amp;lt;time.h&amp;gt;

#define N 32768
#define MRUNS 100000000

void init_arrays(double *a, double *b, int n)
{
    int i;
    for(i=0; i&amp;lt;n; i++)
    {
        a&lt;I&gt; = rand();
        b&lt;I&gt; = rand();
    }
}

int main(int argc, char* argv[])
{
    double *a, *b, *c;
    int i, j;
    double sum;
    srand(time(NULL));
    a = (double*) _mm_malloc(N * sizeof(double), 64);
    b = (double*) _mm_malloc(N * sizeof(double), 64);
    c = (double*) _mm_malloc(N * sizeof(double), 64);
    init_arrays(a, b, N);

    for(i=0; i&amp;lt;MRUNS; i++)
    {
        for(j=0; j&amp;lt;N; j++)
        {
            c&lt;J&gt; = a&lt;J&gt;+b&lt;J&gt;;
        }
    }

    for(i=0; i&amp;lt;N; i++)
        sum += c&lt;I&gt;;
    return 0;
}&lt;/I&gt;&lt;/J&gt;&lt;/J&gt;&lt;/J&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;Here is a portion of the assembly generated by ICC on my machine:&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;..B1.12:                        # Preds ..B1.12 ..B1.11
                                # Execution count [3.28e+10]
..L12:
                # optimization report
                # LOOP WAS UNROLLED BY 4
                # LOOP WAS VECTORIZED
                # VECTORIZATION SPEEDUP COEFFECIENT 6.402344
                # VECTOR TRIP COUNT IS KNOWN CONSTANT
                # VECTOR LENGTH 4
                # MAIN VECTOR TYPE: 64-bits floating point
        vmovupd   (%r12,%rcx,8), %ymm0                          #36.20
        vmovupd   32(%r12,%rcx,8), %ymm2                        #36.20
        vmovupd   64(%r12,%rcx,8), %ymm4                        #36.20
        vmovupd   96(%r12,%rcx,8), %ymm6                        #36.20
        vaddpd    (%rbx,%rcx,8), %ymm0, %ymm1                   #36.25
        vaddpd    32(%rbx,%rcx,8), %ymm2, %ymm3                 #36.25
        vaddpd    64(%rbx,%rcx,8), %ymm4, %ymm5                 #36.25
        vaddpd    96(%rbx,%rcx,8), %ymm6, %ymm7                 #36.25
        vmovupd   %ymm1, (%rax,%rcx,8)                          #36.13
        vmovupd   %ymm3, 32(%rax,%rcx,8)                        #36.13
        vmovupd   %ymm5, 64(%rax,%rcx,8)                        #36.13
        vmovupd   %ymm7, 96(%rax,%rcx,8)                        #36.13
        addq      $16, %rcx                                     #34.9
        cmpq      $32768, %rcx                                  #34.9
        jb        ..B1.12       # Prob 99%                      #34.9&lt;/PRE&gt;

&lt;P&gt;And here is the assembly for the same loop generated by the online ICC compiler.&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;vmovups zmm0,ZMMWORD PTR [r12+rcx*8]
 vmovups zmm2,ZMMWORD PTR [r12+rcx*8+0x40]
 vmovups zmm4,ZMMWORD PTR [r12+rcx*8+0x80]
 vmovups zmm6,ZMMWORD PTR [r12+rcx*8+0xc0]
 vaddpd zmm1,zmm0,ZMMWORD PTR [rbx+rcx*8]
 vaddpd zmm3,zmm2,ZMMWORD PTR [rbx+rcx*8+0x40]
 vaddpd zmm5,zmm4,ZMMWORD PTR [rbx+rcx*8+0x80]
 vaddpd zmm7,zmm6,ZMMWORD PTR [rbx+rcx*8+0xc0]
 vmovupd ZMMWORD PTR [rax+rcx*8],zmm1
 vmovupd ZMMWORD PTR [rax+rcx*8+0x40],zmm3
 vmovupd ZMMWORD PTR [rax+rcx*8+0x80],zmm5
 vmovupd ZMMWORD PTR [rax+rcx*8+0xc0],zmm7
 add rcx,0x20
 cmp rcx,0x8000
 jb 400bf0 &amp;lt;main+0xd0&amp;gt;&lt;/PRE&gt;

&lt;P&gt;I use ICC 17.0.3 and compiler explorer uses 17.0.0.&lt;BR /&gt;
	My CPU is a 10-core i9-7900X. Linux kernel 14.10.0-35-generic.&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 02 Nov 2017 18:50:53 GMT</pubDate>
    <dc:creator>Marko_S_</dc:creator>
    <dc:date>2017-11-02T18:50:53Z</dc:date>
    <item>
      <title>AVX512 auto-vectorization on i9-7900X</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX512-auto-vectorization-on-i9-7900X/m-p/1150748#M6471</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;

&lt;P&gt;I wrote a small piece of code to test the auto-vectorization options of the icc compiler. The code sums 2 arrays (vectors) of doubles and stores it in a 3rd array. I tried to compile it with -xCORE-AVX2 and -xCORE-AVX512 and was expecting a portion of the 2x possible theoretical maximum speedup. But, what I saw was almost the same execution time for both versions (sometimes even worse). At first I thought the cause is the size of the array which doesn't fit in the L1, but when I repeated the runs with arrays size of 256 numbers, I got the same speedup. I noticed the same effect with a series of other scientific apps, and I just can't accept the fact that for every single app I tried, I can't get any speedup at all (unless the app was precompiled somwhere).&lt;/P&gt;

&lt;P&gt;So I checked the objdump of the AVX512 generated binary for my toy program and noticed that there was no usage of zmm registers in the hot loop. However, when I used an online compiler explorer (https://godbolt.org/#) I could see beautiful AVX512 code which I expected to see on my machine as well. The funny thing is that I even use a more recent version of icc compiler, and I just can't get it to produce good code. What could be the issue here? Are there any hints I have to give the compiler? That doesn't seem to bother the online compiler I tested. I tried using different pragmas, the trip counts are known in advance, arrays are aligned, qopt-report says it vectorized the code.&lt;/P&gt;

&lt;P&gt;Any suggestion would be helpful.&lt;/P&gt;

&lt;P&gt;Here is the C code for the toy app.&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#include &amp;lt;stdio.h&amp;gt;
#include &amp;lt;stdlib.h&amp;gt;
#include &amp;lt;time.h&amp;gt;

#define N 32768
#define MRUNS 100000000

void init_arrays(double *a, double *b, int n)
{
    int i;
    for(i=0; i&amp;lt;n; i++)
    {
        a&lt;I&gt; = rand();
        b&lt;I&gt; = rand();
    }
}

int main(int argc, char* argv[])
{
    double *a, *b, *c;
    int i, j;
    double sum;
    srand(time(NULL));
    a = (double*) _mm_malloc(N * sizeof(double), 64);
    b = (double*) _mm_malloc(N * sizeof(double), 64);
    c = (double*) _mm_malloc(N * sizeof(double), 64);
    init_arrays(a, b, N);

    for(i=0; i&amp;lt;MRUNS; i++)
    {
        for(j=0; j&amp;lt;N; j++)
        {
            c&lt;J&gt; = a&lt;J&gt;+b&lt;J&gt;;
        }
    }

    for(i=0; i&amp;lt;N; i++)
        sum += c&lt;I&gt;;
    return 0;
}&lt;/I&gt;&lt;/J&gt;&lt;/J&gt;&lt;/J&gt;&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;Here is a portion of the assembly generated by ICC on my machine:&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;..B1.12:                        # Preds ..B1.12 ..B1.11
                                # Execution count [3.28e+10]
..L12:
                # optimization report
                # LOOP WAS UNROLLED BY 4
                # LOOP WAS VECTORIZED
                # VECTORIZATION SPEEDUP COEFFECIENT 6.402344
                # VECTOR TRIP COUNT IS KNOWN CONSTANT
                # VECTOR LENGTH 4
                # MAIN VECTOR TYPE: 64-bits floating point
        vmovupd   (%r12,%rcx,8), %ymm0                          #36.20
        vmovupd   32(%r12,%rcx,8), %ymm2                        #36.20
        vmovupd   64(%r12,%rcx,8), %ymm4                        #36.20
        vmovupd   96(%r12,%rcx,8), %ymm6                        #36.20
        vaddpd    (%rbx,%rcx,8), %ymm0, %ymm1                   #36.25
        vaddpd    32(%rbx,%rcx,8), %ymm2, %ymm3                 #36.25
        vaddpd    64(%rbx,%rcx,8), %ymm4, %ymm5                 #36.25
        vaddpd    96(%rbx,%rcx,8), %ymm6, %ymm7                 #36.25
        vmovupd   %ymm1, (%rax,%rcx,8)                          #36.13
        vmovupd   %ymm3, 32(%rax,%rcx,8)                        #36.13
        vmovupd   %ymm5, 64(%rax,%rcx,8)                        #36.13
        vmovupd   %ymm7, 96(%rax,%rcx,8)                        #36.13
        addq      $16, %rcx                                     #34.9
        cmpq      $32768, %rcx                                  #34.9
        jb        ..B1.12       # Prob 99%                      #34.9&lt;/PRE&gt;

&lt;P&gt;And here is the assembly for the same loop generated by the online ICC compiler.&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;vmovups zmm0,ZMMWORD PTR [r12+rcx*8]
 vmovups zmm2,ZMMWORD PTR [r12+rcx*8+0x40]
 vmovups zmm4,ZMMWORD PTR [r12+rcx*8+0x80]
 vmovups zmm6,ZMMWORD PTR [r12+rcx*8+0xc0]
 vaddpd zmm1,zmm0,ZMMWORD PTR [rbx+rcx*8]
 vaddpd zmm3,zmm2,ZMMWORD PTR [rbx+rcx*8+0x40]
 vaddpd zmm5,zmm4,ZMMWORD PTR [rbx+rcx*8+0x80]
 vaddpd zmm7,zmm6,ZMMWORD PTR [rbx+rcx*8+0xc0]
 vmovupd ZMMWORD PTR [rax+rcx*8],zmm1
 vmovupd ZMMWORD PTR [rax+rcx*8+0x40],zmm3
 vmovupd ZMMWORD PTR [rax+rcx*8+0x80],zmm5
 vmovupd ZMMWORD PTR [rax+rcx*8+0xc0],zmm7
 add rcx,0x20
 cmp rcx,0x8000
 jb 400bf0 &amp;lt;main+0xd0&amp;gt;&lt;/PRE&gt;

&lt;P&gt;I use ICC 17.0.3 and compiler explorer uses 17.0.0.&lt;BR /&gt;
	My CPU is a 10-core i9-7900X. Linux kernel 14.10.0-35-generic.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 02 Nov 2017 18:50:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX512-auto-vectorization-on-i9-7900X/m-p/1150748#M6471</guid>
      <dc:creator>Marko_S_</dc:creator>
      <dc:date>2017-11-02T18:50:53Z</dc:date>
    </item>
    <item>
      <title>If I understand correctly,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX512-auto-vectorization-on-i9-7900X/m-p/1150749#M6472</link>
      <description>&lt;P&gt;If I understand correctly, there was a change between compiler versions 17.0.0 and 17.0.3 that caused the compiler to be less aggressive about using 512-bit registers.&amp;nbsp;&amp;nbsp; This is a challenging problem for compiler heuristics, because code using 256-bit registers can run at higher frequencies than code using 512-bit registers, and the compiler is making guesses about the relative sizes of the performance impacts of increased frequency vs increased vector width.&lt;/P&gt;

&lt;P&gt;One workaround with version 17.0.3 is to use -xCOMMON-AVX512 instead of -xCORE-AVX512.&amp;nbsp;&amp;nbsp; The COMMON-AVX512 target does not include the new heuristic trade-off code and will use zmm registers whenever it is possible to do so.&lt;/P&gt;

&lt;P&gt;Starting with 17.0.5 (and 18), there is a new compiler flag "-qopt-zmm-usage=high" to override the default heuristic (which corresponds to "-qopt-zmm-usage=low").&amp;nbsp; For the Intel 18 compiler, this option is described at &lt;A href="https://software.intel.com/en-us/cpp-compiler-18.0-developer-guide-and-reference-qopt-zmm-usage-qopt-zmm-usage" target="_blank"&gt;https://software.intel.com/en-us/cpp-compiler-18.0-developer-guide-and-reference-qopt-zmm-usage-qopt-zmm-usage&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 08 Nov 2017 16:53:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX512-auto-vectorization-on-i9-7900X/m-p/1150749#M6472</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2017-11-08T16:53:33Z</dc:date>
    </item>
    <item>
      <title>Thank you John,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX512-auto-vectorization-on-i9-7900X/m-p/1150750#M6473</link>
      <description>&lt;P&gt;Thank you John,&lt;/P&gt;

&lt;P&gt;this helped a lot.&lt;/P&gt;

&lt;P&gt;I decided to use 18.0 and -xCORE-AVX512 with -qopt-zmm-usage=high.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 04 Dec 2017 16:34:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX512-auto-vectorization-on-i9-7900X/m-p/1150750#M6473</guid>
      <dc:creator>Marko_S_</dc:creator>
      <dc:date>2017-12-04T16:34:11Z</dc:date>
    </item>
  </channel>
</rss>

