Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

How to force AVX-2 vs AVX-512



I'm running benchmarks of my code on test hardware (Intel Xeon Gold 5115), and i’m trying to isolate the impact of avx-512 vs avx-2 instructions on overall runtime. My issue is, I don’t know whether or not I’m forcing my code (compiled with icc 2018.1.163 + MKL) to use either instruction set. For reference (I can’t paste our entire codeset here, too long), the code is linear algebra heavy, and has used Intel MKL libraries via gsl_cblas_* calls, where GSL is also compiled with icc+MKL.

Here’s the build scenario:

My avx-2 code build is built on Intel Skylake (E3-1240 v5) hardware, with the following set of compiler flags:

CFLAGS=“-O3 -xcore-avx2 -I/ldcg/intel/2018u1/compilers_and_libraries_2018.1.163/linux/mkl/include”
LDFLAGS=“-L/ldcg/intel/2018u1/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64 -lmkl_rt -lpthread -lm -ldl"

My avx-512 build is built on Xeon Gold 5115 hardware, with the following set of compiler flags:

CFLAGS="-O3 -xCORE-AVX512 -I/ldcg/intel/2018u1/compilers_and_libraries_2018.1.163/linux/mkl/include”
LDFLAGS="-L/ldcg/intel/2018u1/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64 -lmkl_rt -lpthread -lm -ldl"

Okay, so here are my scenarios. I’m using perf to see which system images are being used (maybe this isn’t the best way, but I’m open to other suggestions). A couple years back, I was able to directly count instruction cycles, but I think that functionality was depreciated after sandy bridge.

1. Running the avx-2 code on Skylake hardware: the primary overhead in perf was

2. Running the avx-512 code on Skylake hardware: Illegal Instruction. This was expected, since the code was specifically evoking instructions that the CPU didn’t support.

Now, when I ran the avx-2 and the avx-512 code on the Xeon Gold 5115 machine, I see almost the exact same runtime (~.05% differences). Further, perf is reporting that the primary overhead in in both cases. When I performed a similar study going from sandy bridge—> haswell, I saw overall 20% speedup, so I would expect to see *some* sort of differences.

Right now, it seems like either I’m not properly compiling to use avx-512 instructions, or the compiler is forcing both codes (-xcore-avx2 and -xcore-avx512) to use avx512.

Here’s my question(s):

1. Am I going about this in an inefficient way? What would be a more efficient way to directly confirm which set of instructions are being evoked?

2. Is there something that I’m missing about compiling and forcing avx-2 vs avx-512 instructions?



0 Kudos
1 Reply

You may try MKL_ENABLE_INSTRUCTIONS  environment variable and don't care about specific compiler option. Please refer to the UserGuide to see how to use this.

0 Kudos