I'm running benchmarks of my code on test hardware (Intel Xeon Gold 5115), and i’m trying to isolate the impact of avx-512 vs avx-2 instructions on overall runtime. My issue is, I don’t know whether or not I’m forcing my code (compiled with icc 2018.1.163 + MKL) to use either instruction set. For reference (I can’t paste our entire codeset here, too long), the code is linear algebra heavy, and has used Intel MKL libraries via gsl_cblas_* calls, where GSL is also compiled with icc+MKL.
Here’s the build scenario:
My avx-2 code build is built on Intel Skylake (E3-1240 v5) hardware, with the following set of compiler flags:
CFLAGS=“-O3 -xcore-avx2 -I/ldcg/intel/2018u1/compilers_and_libraries_2018.1.163/linux/mkl/include” LDFLAGS=“-L/ldcg/intel/2018u1/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64 -lmkl_rt -lpthread -lm -ldl"
My avx-512 build is built on Xeon Gold 5115 hardware, with the following set of compiler flags:
CFLAGS="-O3 -xCORE-AVX512 -I/ldcg/intel/2018u1/compilers_and_libraries_2018.1.163/linux/mkl/include” LDFLAGS="-L/ldcg/intel/2018u1/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64 -lmkl_rt -lpthread -lm -ldl"
Okay, so here are my scenarios. I’m using perf to see which system images are being used (maybe this isn’t the best way, but I’m open to other suggestions). A couple years back, I was able to directly count instruction cycles, but I think that functionality was depreciated after sandy bridge.
1. Running the avx-2 code on Skylake hardware: the primary overhead in perf was libmkl_avx2.so.
2. Running the avx-512 code on Skylake hardware: Illegal Instruction. This was expected, since the code was specifically evoking instructions that the CPU didn’t support.
Now, when I ran the avx-2 and the avx-512 code on the Xeon Gold 5115 machine, I see almost the exact same runtime (~.05% differences). Further, perf is reporting that the primary overhead in libmkl_512.so in both cases. When I performed a similar study going from sandy bridge—> haswell, I saw overall 20% speedup, so I would expect to see *some* sort of differences.
Right now, it seems like either I’m not properly compiling to use avx-512 instructions, or the compiler is forcing both codes (-xcore-avx2 and -xcore-avx512) to use avx512.
Here’s my question(s):
1. Am I going about this in an inefficient way? What would be a more efficient way to directly confirm which set of instructions are being evoked?
2. Is there something that I’m missing about compiling and forcing avx-2 vs avx-512 instructions?