Alignment requirements for _mm256_maskload_pd

Stephen_G_1 · ‎05-11-2015

Hi,

Are there any alignment requirements (beyond 8 bytes) for _mm256_maskload_pd and likewise for _mm256_maskstore_pd?

Thanks

Vladimir_Sedach · ‎05-11-2015

Hi Stephen,

My time experiments with both on a Haswell show that:

_mm256_maskload_pd
does not depend on alignment and is 4(!) times as slow as _mm256_loadu_pd.

_mm256_maskstore_pd
is 12%..25% slower than _mm256_storeu_pd if you don't cross cache line boundary ((addr % 64) <= 32) and
has same speed as _mm256_storeu_pd otherwise (3 times slower than with ((addr % 64) <= 32)).

Both don't depend on mask.

Christian_M_2 · ‎05-12-2015

Hello,

Haswell and alignment seems to have some special things. I noticed a code running with aligned load, that worked for all kind of unaligned loads. I realized this as my code suddenly crashed on Sandy Bridge. Traced it back and realized that Haswell and _mm256_load_xx works with any address.

If maskload is that expensive looks like it's similiar implement to scather-gather instructions.

Has anyone tested which is the best method to load unaligned on haswell?

Vladimir_Sedach · ‎05-12-2015

Hi Christian,

_mm256_load_xx can't work with unaligned address. This code crashes:
__ALIGN(32) float   _f[100];
float * volatile   f = _f;
volatile __m256   v = _mm256_load_ps(f + 1);
Compiler is usually smart enough to replace your load's with loadu's.

The best way to load unaligned 256-bit is:
__m256   v;
float p[16] = {1, 0, 0, 0, 1};

v = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_loadu_ps(p)), *((__m128 *)p + 1), 1);

It loads the low order half and inserts the high order half from memory.
This is the method used by Intel compiler. VC and GCC call _mm256_loadu_ps.

Same approach can be used for unaligned stores.

Christian_M_2 · ‎05-13-2015

Hello Vladimir,

ah this calms me down, that aligned load does not work with unaligned one. As the lopp extracted sliding window with an increment of one and a fixed window size, compiler seems to have realized this can't be aligned all the time.

I already use _mm256_loadu_xx. What about mm256_loadu2_m128 and _mm256_lddqu_si256? Especially the last one, might perform better says the intrincis guide. But it does not give any details. Agner Fog's instruction table only lists lddqu ymm, m128 but no vlddqu ymm, m256.

Vladimir_Sedach · ‎05-13-2015

Hello Christian,

_mm256_lddqu_si256 is of the same speed as _mm256_loadu_si256.

_mm256_loadu2_m128 is essentially faster than loadu because it's using insertion as I said before.
It's ~28% faster when loading from a limited area in a loop (L1 cache).

andysem · ‎05-14-2015

Here is something about lddqu and movdqu: https://software.intel.com/en-us/blogs/2012/04/16/history-of-one-cpu-instructions-part-1-lddqumovdqu-explained

The article says that Core 2 and later systems implement lddqu and movdqu similarly but does not clarify how exactly. Given that the original lddqu was not suitable for all memory types I would guess that in recent archtectures both lddqu and movdqu load 16 bytes (32 bytes in case of ymm registers) of unaligned data. I have not seen a confirmation on this though.

Christian_M_2 · ‎05-20-2015

Thanks for all the information!

Vladimir,

as you said with VC and Intel Compiler loadu2 won't give me a benefit as loadu is already implemented with the insert.

andysem,

I read through the article. I agree, on modern CPUs both instructions should behave the same way as the support SSSE3.