vmovdqu VS vmovdqu16

Guy_A · ‎05-06-2024

I was trying to understand the difference between `_mm256_loadu_epi16` and `_mm256_loadu_si256`.

According to the intrinsics manual, they both return the same type and get an unbounded pointer, but result in different instructions `vmovdqu` VS `vmovdqu16`.

As the `_mm256_loadu_epi16` requires more flags, AVX512BW + AVX512VL has slightly worse latency compared to `_mm256_loadu_si256` that requires only AVX, I could not understand what the benefit of the explicit `epi16` variant.

In addition, there is also `_mm256_lddqu_si256` that should be equivalent to `_mm256_loadu_si256` but "may perform better than _mm256_loadu_si256 when the data crosses a cache line boundary".

Any advice or explanation?

I appreciate any help you can provide.

Alex_Y_Intel · ‎05-07-2024

Yes, I agree with you, I don't see any benefit from the comparison tables either.