Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
82 Views

Alignment requirements for _mm256_maskload_pd

Hi,

Are there any alignment requirements (beyond 8 bytes) for _mm256_maskload_pd and likewise for _mm256_maskstore_pd?

Thanks

0 Kudos
7 Replies
Highlighted
New Contributor I
82 Views

Hi Stephen,

My time experiments with both on a Haswell show that:

_mm256_maskload_pd
does not depend on alignment and is 4(!) times as slow as _mm256_loadu_pd.

_mm256_maskstore_pd
is 12%..25% slower than _mm256_storeu_pd if you don't cross cache line boundary ((addr % 64) <= 32) and
has same speed as _mm256_storeu_pd otherwise (3 times slower than with ((addr % 64) <= 32)).


Both don't depend on mask.

0 Kudos
Highlighted
Beginner
82 Views

Hello,

Haswell and alignment seems to have some special things. I noticed a code running with aligned load, that worked for all kind of unaligned loads. I realized this as my code suddenly crashed on Sandy Bridge. Traced it back and realized that Haswell and _mm256_load_xx works with any address.

If maskload is that expensive looks like it's similiar implement to scather-gather instructions.

Has anyone tested which is the best method to load unaligned on haswell?

0 Kudos
Highlighted
New Contributor I
82 Views

Hi Christian,

_mm256_load_xx can't work with unaligned address. This code crashes:
    __ALIGN(32) float    _f[100];
    float * volatile    f = _f;
    volatile __m256    v = _mm256_load_ps(f + 1);
Compiler is usually smart enough to replace your load's with
loadu's.

The best way to load unaligned 256-bit is:
    __m256    v;
   
float    p[16] = {1, 0, 0, 0, 1};

    v = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_loadu_ps(p)), *((__m128 *)p + 1), 1);

It loads the low order half and inserts the high order half from memory.
This is the method used by Intel compiler. VC and GCC call 
_mm256_loadu_ps.

Same approach can be used for 
unaligned stores.
 

0 Kudos
Highlighted
Beginner
82 Views

Hello Vladimir,

ah this calms me down, that aligned load does not work with unaligned one. As the lopp extracted sliding window with an increment of one and a fixed window size, compiler seems to have realized this can't be aligned all the time.

I already use _mm256_loadu_xx. What about mm256_loadu2_m128 and _mm256_lddqu_si256? Especially the last one, might perform better says the intrincis guide. But it does not give any details. Agner Fog's instruction table only lists lddqu ymm, m128 but no vlddqu ymm, m256.

0 Kudos
Highlighted
New Contributor I
82 Views

Hello Christian,

 _mm256_lddqu_si256 is of the same speed as  _mm256_loadu_si256.

_mm256_loadu2_m128 is
essentially faster than loadu because it's using insertion as I said before.
It's ~28% faster when loading from a limited area in a loop (L1 cache).

0 Kudos
Highlighted
New Contributor III
82 Views

Here is something about lddqu and movdqu: https://software.intel.com/en-us/blogs/2012/04/16/history-of-one-cpu-instructions-part-1-lddqumovdqu...

The article says that Core 2 and later systems implement lddqu and movdqu similarly but does not clarify how exactly. Given that the original lddqu was not suitable for all memory types I would guess that in recent archtectures both lddqu and movdqu load 16 bytes (32 bytes in case of ymm registers) of unaligned data. I have not seen a confirmation on this though.

 

0 Kudos
Highlighted
Beginner
82 Views

Thanks for all the information!

Vladimir,

as you said with VC and Intel Compiler loadu2 won't give me a benefit as loadu is already implemented with the insert.

andysem,

I read through the article. I agree, on modern CPUs both instructions should behave the same way as the support SSSE3.

 

0 Kudos