- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We've learned that if the compiler emits an aligned SSE memory move instruction for an unaligned address, it will cause a SEGV. Will the same occur with AVX? Or in the case of AVX is the extent of the resulting behavior amount to undesirable performance?
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think you answered the question yourself: aligned instructions require alignment!
In more detail:-
The Fine Manual gives details of the properties of each instruction, as does the online intrinsics guide; here you can see the properties of AVX load instructions. You will observe that there are both aligned and unaligned loads, for instance :-
Synopsis
#include "immintrin.h"
Instruction: vmovapd ymm, m256
CPUID Flags: AVX
Description
and
__m256d _mm256_loadu_pd (double const * mem_addr)
Synopsis
#include "immintrin.h"
Instruction: vmovupd ymm, m256
CPUID Flags: AVX
Description
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Aligned VEX-encoded loads and stores (i.e. vmovdqa) still require aligned memory operands. However, memory operands for other VEX-encoded instructions (e.g. vpaddd) need not be aligned. You will still pay performance penalty for unaligned memory access though. Refer to Intel Software Developer Manual for the description of particular instructions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Tim P. AFAIK, legacy SSE instructions (i.e. non-VEX-encoded) haven't changed and still require aligned memory operands where they previously did. Only the VEX-encoded equivalents have relaxed requirements.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I addressed this recently in a different forum topic, but I can't find the reference right now....
In the beginning, SSE supported unaligned 128-bit loads/stores only via the MOVUPS instruction. All 128-bit memory references that were input arguments to other instructions were required to be 128-bit aligned to avoid a protection fault. In the earliest SSE systems, MOVAPS was faster, so it was preferred when the data was known to be aligned. Later systems eliminated the performance penalty of MOVUPS in the case where the data was aligned, so the compiler switched to generating MOVUPS even in the cases where it knew the data was aligned.
AVX relaxed the alignment restrictions for input arguments for both 128-bit and 256-bit loads. BUT, every generation of processor had different performance penalties for executing these memory references without natural alignment.
From memory:
- Sandy Bridge
- Loads
- 2 loads per cycle (up to 128-bit) in the absence of bank conflicts or cache line crossing.
- I.e.., no penalty for unaligned loads that do not cross a cache line boundary.
- 128-bit loads that cross a cache line boundary reduce the rate to 1 load every 2 cycles.
- 256-bit loads take 2 cycles, but two can execute in parallel in the absence of bank conflicts or cache line crossing.
- 256-bit loads that cross a cache line boundary reduce the rate to 1 load every 4 cycles.
- Loads that cross a 4KiB page boundary have a larger penalty, but at least part of that penalty can be overlapped with subsequent loads. The detailed mechanisms are not clear.
- 2 loads per cycle (up to 128-bit) in the absence of bank conflicts or cache line crossing.
- Stores
- Big (?) penalty for any sized store that crosses a cache line boundary.
- Huge (>100 cycle) penalty for any store that crosses a 4KiB page boundary.
- Because there are only 2 address generation units, it is not possible to perform 2 loads and 1 store per cycle.
- 2 256-bit loads plus 1 256-bit store every 2 cycles is supported, but it is extremely difficult to avoid bank conflicts in this case.
- Loads
- Ivy Bridge
- I think there were reductions in the penalties for cache-line and page crossing, but I don't recall that I ever measured them in detail.
- Haswell
- Loads
- 2 loads per cycle (up to 256-bit) for any alignment in the absence of cache line crossing.
- 1 load per cycle for any sized load that crosses a cache line boundary.
- Loads that cross a 4KiB page boundary have a larger penalty, but at least part of that penalty can be overlapped with subsequent loads. The detailed mechanisms are not clear.
- Stores
- One store per cycle (any size or alignment) as long as it does not cross a cache line boundary.
- I think that the penalties for cache-line-crossing and 4KiB-page-crossing are much smaller than on SNB, but I don't have the numbers handy.
- A 3rd address generation unit was added to allow 2 loads plus 1 store per cycle.
- Loads
- Skylake Xeon
- I have not tested this yet, but it certainly supports 2 512-bit aligned loads per cycle, or 1 512-bit aligned load plus any other load that does not cross a cache line boundary.
- This could be built on the same physical interface that Haswell uses -- dual-read-port, 512-bit port width.
- Skylake Xeon does not appear to be able to support 2 512-bit loads plus 1 512-bit store per cycle, but the reported performance is slightly higher than 2 512-bit loads per cycle. I have not checked to see whether this inability to fully overlap also applies to 128-bit and/or 256-bit 2-load-plus-1-store combinations.
- I have not tested this yet, but it certainly supports 2 512-bit aligned loads per cycle, or 1 512-bit aligned load plus any other load that does not cross a cache line boundary.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page