Code Faster with SSE4 than with AVX2

joe-griffin · ‎04-05-2019

If I run on an AVX2 or AVX512 system with:
export MKL_ENABLE_INSTRUCTIONS=SSE4_2
it is almost twice as fast than if I run with:
export MKL_ENABLE_INSTRUCTIONS=AVX2

Details:

I tried with different compilers/libraries:

compilers_and_libraries_2017.5.239 and compilers_and_libraries_2019.3.199

I tried on different hardware ( linux ):

Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz and Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

I ran on both Linux and Windows.

I ran with MKL_VERBOSE=1. Just looking at DSCAL

SSE4_2:

sudev604 <97> grep DSCAL nas31343_SSE4_2.log | head -4
MKL_VERBOSE DSCAL(3,0x7ffdfc91db88,0x7f07c18da6a0,1) 77.53us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 WDiv:HOST:-1.000 WDiv:0:-1.000 WDiv:1:-1.000
MKL_VERBOSE DSCAL(3,0x7ffdfc91db88,0x7f07c18da6c8,1) 489ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 WDiv:HOST:-1.000 WDiv:0:-1.000 WDiv:1:-1.000
MKL_VERBOSE DSCAL(3,0x7ffdfc91db88,0x7f07c18da6f0,1) 231ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 WDiv:HOST:-1.000 WDiv:0:-1.000 WDiv:1:-1.000
MKL_VERBOSE DSCAL(3,0x7ffdfc91db88,0x7f07c18da718,1) 149ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 WDiv:HOST:-1.000 WDiv:0:-1.000 WDiv:1:-1.000

And AVX2:

em64tn <104> grep DSCAL nas31343_AVX2.log | head -4
MKL_VERBOSE DSCAL(3,0x7ffeb11ad908,0x7f14898da6a0,1) 63.13us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 WDiv:HOST:-1.000 WDiv:0:-1.000 WDiv:1:-1.000
MKL_VERBOSE DSCAL(3,0x7ffeb11ad908,0x7f14898da6c8,1) 539ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 WDiv:HOST:-1.000 WDiv:0:-1.000 WDiv:1:-1.000
MKL_VERBOSE DSCAL(3,0x7ffeb11ad908,0x7f14898da6f0,1) 374ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 WDiv:HOST:-1.000 WDiv:0:-1.000 WDiv:1:-1.000
MKL_VERBOSE DSCAL(3,0x7ffeb11ad908,0x7f14898da718,1) 176ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 WDiv:HOST:-1.000 WDiv:0:-1.000 WDiv:1:-1.000

My guess is that either "us" or "ns" are times.

If I do not set MKL_ENABLE_INSTRUCTIONS, then I use the AVX2 settings and my code runs slow.

Steve_Lionel · ‎04-05-2019

AVX isn't always faster than SSE. The compiler usually gets it right, but maybe MKL doesn't. You should really take this up in the MKL forum as it's not related to the Fortran compiler.