Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*

vmovdqu VS vmovdqu16

Guy_A
Beginner
599 Views

I was trying to understand the difference between `_mm256_loadu_epi16` and `_mm256_loadu_si256`.

According to the intrinsics manual, they both return the same type and get an unbounded pointer, but result in different instructions `vmovdqu` VS `vmovdqu16`.

As the `_mm256_loadu_epi16` requires more flags, AVX512BW + AVX512VL has slightly worse latency compared to `_mm256_loadu_si256` that requires only AVX, I could not understand what the benefit of the explicit `epi16` variant.

In addition, there is also `_mm256_lddqu_si256` that should be equivalent to `_mm256_loadu_si256` but "may perform better than _mm256_loadu_si256 when the data crosses a cache line boundary".

Any advice or explanation?

I appreciate any help you can provide.

 

 

 

 

 

0 Kudos
1 Reply
Alex_Y_Intel
Moderator
549 Views

Yes, I agree with you, I don't see any benefit from the comparison tables either. 

0 Kudos
Reply