Compiling code using AVX 512 intrinsics with processor specific optimization set to AVX2, results in AVX 512 instructions (looking at generated assembly) using only zmm0-zmm15.
So only 16 registers are used instead of 32.
With processor specific optimization set to AVX512 core, all 32 zmm registers are used.
(I need to compile with processor specific to AVX2, else the AVX2 code path does no run.)
Is this a known issue that can be fixed ?
- Development Tools
- Intel® C++ Compiler
- Intel® Parallel Studio XE
- Intel® System Studio
- Parallel Computing
Intrinsics are assembly-coded; so if you call AVX512 instrinsics you will see zmm registers being used.
If you target AVX2, then I would suggest to use AVX2 instrinsics instead.
Sorry if my explanation was not clear.
AVX512 intrinsics of course generate assembly using zmm registers.
My point is the produced assembly only uses 16 out of the 32 zmm registers, which is not good, it should use all, to reduce register spilling.
Filed an issue CMPLRS-45119 for this. Compiler developers are working on this. In the meantime, try the workaround _allow_cpu_features(_FEATURES_AVX512F). More information documented at https://software.intel.com/en-us/articles/new-intrinsic-allow-cpu-features-support