Solved: slow AVX2 instruction when ymm register is used

Rafaello7 · ‎10-10-2023

The program below shows a HUGE difference in its speed depending on the vpbroadcastq instruction argument. On my computer the program finishes after ~200ms when xmm1 register is used as the destination argument. With ymm register the program execution time increases to about 9 seconds. How is it possible? The CPU is i7-1260P. On another machine, with i3-10110U cpu, the difference is minor.

#include <stdio.h>

unsigned long reverseasm(unsigned long edges) {
    unsigned long res;
    asm (
        "mov $1, %%r8\n"
        "movq %%r8, %%xmm0\n"
#if 0
        "vpbroadcastq %%xmm0, %%xmm1\n"
#else
        "vpbroadcastq %%xmm0, %%ymm1\n"
#endif
        "mov $1, %[res]\n"
        : [res]         "=r"    (res)
        :
        : "r8", "ymm0", "ymm1"
        );
    return res;
}

int main(int argc, char *argv[])
{
    unsigned long sum = 0;
    for(unsigned i = 0; i < 100000000; ++i)
        sum += reverseasm(i);
    printf("sum=%ld\n", sum);
    return 0;
}

Roman_D_Intel · ‎10-13-2023

Hi,

the explanation is found the Intel Architectures Optimization Reference Manual Volume 1 (www.intel.com/sdm) section 3.11.6 Instruction Sequence Slowdowns. The Golden Cove performance core eliminated some hardware speed paths when switching from/to SSE and VEX (AVX) and replaced them with microcode. The solutions are: 1) use VEX-encoded instructions (like vmovq) when possible or 2) insert VZEROUPPER. Many more details can be found in the manual.

Roman

View solution in original post

Rafaello7 · ‎10-12-2023

My findings so far:

the slowness occurs only on a performance core of the CPU; efficient cores are not affected
the slowness occurs only when AVX versions of the instructions are mixed with legacy SSE instructions.
In the above code, the movq instruction is a legacy one. When the movq instruction is changed to vmovq, the problem disappears.

Roman_D_Intel · ‎10-13-2023

Hi,

the explanation is found the Intel Architectures Optimization Reference Manual Volume 1 (www.intel.com/sdm) section 3.11.6 Instruction Sequence Slowdowns. The Golden Cove performance core eliminated some hardware speed paths when switching from/to SSE and VEX (AVX) and replaced them with microcode. The solutions are: 1) use VEX-encoded instructions (like vmovq) when possible or 2) insert VZEROUPPER. Many more details can be found in the manual.

Roman

Rafaello7 · ‎10-14-2023

Thanks.

I admit, I was not aware that SSE and AVX instructions should not be mixed. I did read the Software Developer Manual and I didn't find any such information. I wondered why all the instructions have two variants - for example the movq/vmovq. Although the vmovq instruction clears the upper part of ymm register, but who cares? Preserving/zeroing the upper ymm part is rarely useful. Programmers can cope without that.

Similarly, I'm wondering why the movupd instruction description states it moves the double precision floating point values? Is it important to have a valid floating point numbers in the registers? Or, can it also be any integer data? Is there any caveat?

Rafaello7 · ‎10-15-2023

Another interesting thing, the shld instruction takes about 11 CPU cycles on an efficient core. On a performance core it takes one cycle.