Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

slow AVX2 instruction when ymm register is used

Rafaello7
Beginner
1,228 Views

The program below shows a HUGE difference in its speed depending on the vpbroadcastq instruction argument. On my computer the program finishes after ~200ms when xmm1 register is used as the destination argument. With ymm register the program execution time increases to about 9 seconds. How is it possible? The CPU is i7-1260P.  On another machine, with i3-10110U cpu, the difference is minor.

 

#include <stdio.h>

unsigned long reverseasm(unsigned long edges) {
    unsigned long res;
    asm (
        "mov $1, %%r8\n"
        "movq %%r8, %%xmm0\n"
#if 0
        "vpbroadcastq %%xmm0, %%xmm1\n"
#else
        "vpbroadcastq %%xmm0, %%ymm1\n"
#endif
        "mov $1, %[res]\n"
        : [res]         "=r"    (res)
        :
        : "r8", "ymm0", "ymm1"
        );
    return res;
}

int main(int argc, char *argv[])
{
    unsigned long sum = 0;
    for(unsigned i = 0; i < 100000000; ++i)
        sum += reverseasm(i);
    printf("sum=%ld\n", sum);
    return 0;
}

 

0 Kudos
1 Solution
Roman_D_Intel
Employee
1,161 Views

Hi,

 

the explanation is found the Intel Architectures Optimization Reference Manual Volume 1 (www.intel.com/sdm) section 3.11.6 Instruction Sequence Slowdowns. The Golden Cove performance core eliminated some hardware speed paths when switching from/to SSE and VEX (AVX) and replaced them with microcode.  The solutions are: 1) use VEX-encoded instructions (like vmovq) when possible or 2) insert VZEROUPPER.  Many more details can be found in the manual.

 

Roman

View solution in original post

4 Replies
Rafaello7
Beginner
1,191 Views

My findings so far:

  • the slowness occurs only on a performance core of the CPU; efficient cores are not affected
  • the slowness occurs only when AVX versions of the instructions are mixed with legacy SSE instructions.
    In the above code, the movq instruction is a legacy one. When the movq instruction is changed to vmovq, the problem disappears.

 

0 Kudos
Roman_D_Intel
Employee
1,162 Views

Hi,

 

the explanation is found the Intel Architectures Optimization Reference Manual Volume 1 (www.intel.com/sdm) section 3.11.6 Instruction Sequence Slowdowns. The Golden Cove performance core eliminated some hardware speed paths when switching from/to SSE and VEX (AVX) and replaced them with microcode.  The solutions are: 1) use VEX-encoded instructions (like vmovq) when possible or 2) insert VZEROUPPER.  Many more details can be found in the manual.

 

Roman

Rafaello7
Beginner
1,113 Views

Thanks.

 

I admit, I was not aware that SSE and AVX instructions should not be mixed. I did read the Software Developer Manual and I didn't find any such information. I wondered why all the instructions have two variants - for example the movq/vmovq. Although the vmovq instruction clears the upper part of ymm register, but who cares? Preserving/zeroing the upper ymm part is rarely useful. Programmers can cope without that.

 

Similarly, I'm wondering why the movupd instruction description states it moves the double precision floating point values? Is it important to have a valid floating point numbers in the registers? Or, can it also be any integer data? Is there any caveat?

0 Kudos
Rafaello7
Beginner
1,050 Views

Another interesting thing, the shld instruction takes about 11 CPU cycles on an efficient core. On a performance core it takes one cycle.

0 Kudos
Reply