FMA instructions performance AVX2 and AVX512

perera__niranda · ‎03-08-2019

Hi,

I am working on FMA instructions on AVX2 and AVX512. While going through the documentation I found out that the semantics are slightly different.

Considering FMADD231 for doubles

in AVX2,
- VFMADD231PD ymm0, ymm1, ymm2/m256 where it accepts 32B(256b) aligned memory location.
BUT in AVX512
- VFMADD231PD zmm0 {k1}{z}, zmm1, zmm2/m512/m64bcst{er} it accepts both 64B(512b) aligned memory location as well as 8B(64b) aligned memory location.

I was with the impression that this memory alignment specification was enforced as a performance measure. But in AVX512 this restriction is relaxed! I was wondering if there would be a performance difference between m512 and m64bcst?

Look forward to your opinion. Thank you in advance.

Best

McCalpinJohn · ‎03-08-2019

AVX2 instructions do not require naturally aligned memory references -- that was an SSE restriction. Some/most processors have a performance penalty for unaligned memory references -- especially for loads that cross cache line boundaries and (even bigger) for loads that cross 4KiB page boundaries. The compiler makes an effort to avoid unaligned loads for some target architectures, but this has not been a big performance issue since Sandy Bridge.

The description of the VFMADD*PD instruction in Volume 2 of the Intel Architectures SW Developer's Guide lists the exceptions types as "Exceptions Type 2" for VEX-encoded instructions (AVX), and "Exceptions Type E2" for EVEX-encoded instructions (AVX-512). The only alignment-related exceptions in the table of "Type 2" exceptions are for "Legacy SSE" instructions. For "Type E2" exceptions, the only alignment exception possible requires both enabling alignment checking and using the broadcast function on an unaligned 8-Byte memory operand.