- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Are there any alignment requirements (beyond 8 bytes) for _mm256_maskload_pd and likewise for _mm256_maskstore_pd?
Thanks
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Stephen,
My time experiments with both on a Haswell show that:
_mm256_maskload_pd
does not depend on alignment and is 4(!) times as slow as _mm256_loadu_pd.
_mm256_maskstore_pd
is 12%..25% slower than _mm256_storeu_pd if you don't cross cache line boundary ((addr % 64) <= 32) and
has same speed as _mm256_storeu_pd otherwise (3 times slower than with ((addr % 64) <= 32)).
Both don't depend on mask.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Haswell and alignment seems to have some special things. I noticed a code running with aligned load, that worked for all kind of unaligned loads. I realized this as my code suddenly crashed on Sandy Bridge. Traced it back and realized that Haswell and _mm256_load_xx works with any address.
If maskload is that expensive looks like it's similiar implement to scather-gather instructions.
Has anyone tested which is the best method to load unaligned on haswell?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Christian,
_mm256_load_xx can't work with unaligned address. This code crashes:
__ALIGN(32) float _f[100];
float * volatile f = _f;
volatile __m256 v = _mm256_load_ps(f + 1);
Compiler is usually smart enough to replace your load's with loadu's.
The best way to load unaligned 256-bit is:
__m256 v;
float p[16] = {1, 0, 0, 0, 1};
v = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_loadu_ps(p)), *((__m128 *)p + 1), 1);
It loads the low order half and inserts the high order half from memory.
This is the method used by Intel compiler. VC and GCC call _mm256_loadu_ps.
Same approach can be used for unaligned stores.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Vladimir,
ah this calms me down, that aligned load does not work with unaligned one. As the lopp extracted sliding window with an increment of one and a fixed window size, compiler seems to have realized this can't be aligned all the time.
I already use _mm256_loadu_xx. What about mm256_loadu2_m128 and _mm256_lddqu_si256? Especially the last one, might perform better says the intrincis guide. But it does not give any details. Agner Fog's instruction table only lists lddqu ymm, m128 but no vlddqu ymm, m256.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Christian,
_mm256_lddqu_si256 is of the same speed as _mm256_loadu_si256.
_mm256_loadu2_m128 is essentially faster than loadu because it's using insertion as I said before.
It's ~28% faster when loading from a limited area in a loop (L1 cache).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is something about lddqu and movdqu: https://software.intel.com/en-us/blogs/2012/04/16/history-of-one-cpu-instructions-part-1-lddqumovdqu-explained
The article says that Core 2 and later systems implement lddqu and movdqu similarly but does not clarify how exactly. Given that the original lddqu was not suitable for all memory types I would guess that in recent archtectures both lddqu and movdqu load 16 bytes (32 bytes in case of ymm registers) of unaligned data. I have not seen a confirmation on this though.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for all the information!
Vladimir,
as you said with VC and Intel Compiler loadu2 won't give me a benefit as loadu is already implemented with the insert.
andysem,
I read through the article. I agree, on modern CPUs both instructions should behave the same way as the support SSSE3.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page