- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If I run on an AVX2 or AVX512 system with:
export MKL_ENABLE_INSTRUCTIONS=SSE4_2
it is almost twice as fast than if I run with:
export MKL_ENABLE_INSTRUCTIONS=AVX2
Details:
I tried with different compilers/libraries:
compilers_and_libraries_2017.5.239 and compilers_and_libraries_2019.3.199
I tried on different hardware ( linux ):
Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz and Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
I ran on both Linux and Windows.
I ran with MKL_VERBOSE=1. Just looking at DSCAL
SSE4_2:
sudev604 <97> grep DSCAL nas31343_SSE4_2.log | head -4
MKL_VERBOSE DSCAL(3,0x7ffdfc91db88,0x7f07c18da6a0,1) 77.53us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 WDiv:HOST:-1.000 WDiv:0:-1.000 WDiv:1:-1.000
MKL_VERBOSE DSCAL(3,0x7ffdfc91db88,0x7f07c18da6c8,1) 489ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 WDiv:HOST:-1.000 WDiv:0:-1.000 WDiv:1:-1.000
MKL_VERBOSE DSCAL(3,0x7ffdfc91db88,0x7f07c18da6f0,1) 231ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 WDiv:HOST:-1.000 WDiv:0:-1.000 WDiv:1:-1.000
MKL_VERBOSE DSCAL(3,0x7ffdfc91db88,0x7f07c18da718,1) 149ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 WDiv:HOST:-1.000 WDiv:0:-1.000 WDiv:1:-1.000
And AVX2:
em64tn <104> grep DSCAL nas31343_AVX2.log | head -4
MKL_VERBOSE DSCAL(3,0x7ffeb11ad908,0x7f14898da6a0,1) 63.13us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 WDiv:HOST:-1.000 WDiv:0:-1.000 WDiv:1:-1.000
MKL_VERBOSE DSCAL(3,0x7ffeb11ad908,0x7f14898da6c8,1) 539ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 WDiv:HOST:-1.000 WDiv:0:-1.000 WDiv:1:-1.000
MKL_VERBOSE DSCAL(3,0x7ffeb11ad908,0x7f14898da6f0,1) 374ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 WDiv:HOST:-1.000 WDiv:0:-1.000 WDiv:1:-1.000
MKL_VERBOSE DSCAL(3,0x7ffeb11ad908,0x7f14898da718,1) 176ns CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8 WDiv:HOST:-1.000 WDiv:0:-1.000 WDiv:1:-1.000
My guess is that either "us" or "ns" are times.
If I do not set MKL_ENABLE_INSTRUCTIONS, then I use the AVX2 settings and my code runs slow.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
AVX isn't always faster than SSE. The compiler usually gets it right, but maybe MKL doesn't. You should really take this up in the MKL forum as it's not related to the Fortran compiler.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page