- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was trying to understand the difference between `_mm256_loadu_epi16` and `_mm256_loadu_si256`.
According to the intrinsics manual, they both return the same type and get an unbounded pointer, but result in different instructions `vmovdqu` VS `vmovdqu16`.
As the `_mm256_loadu_epi16` requires more flags, AVX512BW + AVX512VL has slightly worse latency compared to `_mm256_loadu_si256` that requires only AVX, I could not understand what the benefit of the explicit `epi16` variant.
In addition, there is also `_mm256_lddqu_si256` that should be equivalent to `_mm256_loadu_si256` but "may perform better than _mm256_loadu_si256 when the data crosses a cache line boundary".
Any advice or explanation?
I appreciate any help you can provide.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, I agree with you, I don't see any benefit from the comparison tables either.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page