Re: C/C++ compiler 2025.0.0 AVX512 code generation bug

Paul-C · ‎02-07-2025

Given the file load128.c as follows:

#include <immintrin.h>

extern double array[];
void f(__m512d);

int main() {
    __m512d x = _mm512_castpd128_pd512(_mm_loadu_pd(array));
    f(x);
}

and compiling as follows:

icx -xCORE-AVX512 -O3 -c load128.c

the compiler generates:

vinsertf128 $0x0,0x0(%rip),%ymm0,%ymm0

This unnecessarily preserves bits [255:128] of ymm0. It clears bits [511:256] of zmm0.

The 128-bit load intrinsic (when compiling AVX512 code) should not preserve the next higher 128 bits and also clear the upper 256 bits.

Preserving some bits while clearing others is surprising and makes no sense.
The resulting code is slower--the vinsertf128 instruction both reads and writes ymm0. This causes a dependency in the hardware pipeline and reduces instruction level parallelism. In a loop, for example, hardware cannot execute an instance of this instruction until the previous instruction that writes ymm0 is finished.
It can lead to correctness issues if the programmer is relying on the upper 384 bits to be all zero. For example, floating point instructions in f(x) can overflow and cause traps.
If the programmer wanted vinsertf128 behavior, he would have used the intrinsic _mm256_insertf128_pd().

The oneAPI compiler version 2023.2.0.20230622 produces:

vmovups 0x0(%rip),%xmm0

using the VEX.128 prefix, so the upper 384 bits of zmm0 are cleared. (It does not emit the SSE instruction that has a similar name.) This is correct. I hope someone at Intel can file a compiler bug for this regression on my behalf.

Sravani_K_Intel · ‎02-11-2025

Thanks for reporting, this has been escalated as a bug report to the Compiler team.

Sravani_K_Intel · ‎03-11-2025

This issue has been fixed and will be available in the 2025.2 release, as we have already reached the cutoff for the upcoming 2025.1 release.