- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
for data fetching there always are load and loadu intrinsics. load only accepts aligned addresses and loadu will work in both cases.
But what about performance? Latency and Throughput of both instructions is the same according to Intel Intrinsics Guide. What will happen if loadu is executed on aligned addresses? Do I get the same performance compared to load? Or is loadu slower regardless of the real alignment of the given address.
Thanky for any hints!
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On recent Intel CPU models (those which support SSE4), unaligned load is supposed to be as fast as aligned.
Beginning with Intel 12.0 compilers, the aligned instructions aren't used for SSE4 or AVX, even when generating code which requires alignment.
For AVX code, where alignment is expected, the compilers use AVX-256 movups/movupd, while the moves are split into AVX-128 pairs when alignment is unknown. On the Ivy Bridge corei7-3 CPUs, AVX-256 unaligned load should be faster than pairs of AVX-128 loads, regardless of alignment, but I haven't seen a compiler make that distinction.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Tim
While coding in inline assembly can I issue this instruction : "add xmm0,[eax+offset]", where eax register holds the value of the pointer to an aligned SoA.
I have Core i3 CPU.
Thanks in advance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@ Tim:
Thank you for your explanation!
I did some tests and found that aligned and unaligned reach nearly same speed.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page