load and loadu - alignment

Christian_M_2 · ‎01-08-2013

Hello,

for data fetching there always are load and loadu intrinsics. load only accepts aligned addresses and loadu will work in both cases.

But what about performance? Latency and Throughput of both instructions is the same according to Intel Intrinsics Guide. What will happen if loadu is executed on aligned addresses? Do I get the same performance compared to load? Or is loadu slower regardless of the real alignment of the given address.

Thanky for any hints!

TimP · ‎01-09-2013

On recent Intel CPU models (those which support SSE4), unaligned load is supposed to be as fast as aligned. Beginning with Intel 12.0 compilers, the aligned instructions aren't used for SSE4 or AVX, even when generating code which requires alignment. For AVX code, where alignment is expected, the compilers use AVX-256 movups/movupd, while the moves are split into AVX-128 pairs when alignment is unknown. On the Ivy Bridge corei7-3 CPUs, AVX-256 unaligned load should be faster than pairs of AVX-128 loads, regardless of alignment, but I haven't seen a compiler make that distinction.

Bernard · ‎01-09-2013

@Tim While coding in inline assembly can I issue this instruction : "add xmm0,[eax+offset]", where eax register holds the value of the pointer to an aligned SoA. I have Core i3 CPU. Thanks in advance.

Christian_M_2 · ‎01-11-2013

@ Tim: Thank you for your explanation! I did some tests and found that aligned and unaligned reach nearly same speed.