while working with AVX-2 and -512, we noticed the following discrepancies:
1) Why does _mm256_i64gather_epi64 return an __m128i according to the documentation? We would expect an __m256i. Dash agrees.
2) Why is the AVX-512 stream load interface different from AVX2?
extern __m256i _mm256_stream_load_si256(__m256i const *); extern __m512i _mm512_stream_load_si512(void * mem_addr);
Especially the missing constness is a problem (albeit minor) because it requires a const_cast that should be unnecessary.
I didn't mean to address the touchy question of which method is meant to be used to make the macro definitions available in your code for your preferred ISA. Hint: it's not (AFAIK) by including or not including zmmintrin.h directly. The arch or equivalent compiler flag must be set to the target ISA. With Intel compilers, that involves automatic promotion of SSE2 to newer ISA according to the setting. It may even avoid use of AVX2 macros if those are recognized as slower than AVX. If you want it to work with several compilers (e.g. MSVC, Intel, gnu, clang) I think you have to test each of them.
I supposed zmmintrin.h was named for its use of the z (512-bit) registers.