Hello Forum –
I am learning SIMD intrinsic and one of the requirements for the load and store instructions is that data has to be 16 bytes aligned. My question is - if the data bus size is 64 bits for x86_64 platform then should it not be aligned to 8 bytes (which is natural alignment)?
In summary, why should the data be aligned to 16 bytes for SSE instructions, why not 8 bytes?
Most SSE instructions that include 128-bit memory references will generate a "general protection fault" if the address is not 16-byte-aligned.
With AVX, most instructions that reference memory no longer require special alignment, but performance is reduced by varying degrees depending on the instruction type and processor generation. Aligning memory references to the size of the memory operand is still preferred, with highest priority going to aligning stores, and secondary priority to aligning loads.
Non-temporal (streaming) stores require alignment to match the operand size in all of the instruction sets (SSE, AVX, AVX2, AVX-512).
John thanks for the reply, basically I am trying to understand why 16 bytes... is it because the size of the bus from Execution unit to L1 cache is 128 bits and for that the natural alignment is 16 bytes?
Either the interface width or the L1 Data Cache banking of the first systems to support SSE probably had a "natural" 128-bit (16-Byte) size or alignment. An implementation that did not support misaligned accesses would be both simpler and faster (lower latency). The unaligned cases were limited to explicit load instructions (e.g., MOVUPS), which (initially) paid a performance penalty relative to the aligned load instructions (e.g., MOVAPS).
Later generations relaxed the alignment restrictions, partially because they could throw more transistors at the problem of dealing with the misaligned cases. One of the first things that happened is that MOVUPS became as fast as MOVAPS in the case where the data was actually aligned, so the MOVAPS instruction was no longer generated by the compilers.
With the Sandy Bridge core (supporting AVX), the L1 Data Cache had 8 banks, each 8 Bytes wide, so a 128-bit load required accessing 2 banks (for the aligned case) or 3 banks (for the unaligned case). Two loads per cycle could be sustained if they accessed different banks. There were modest penalties for 128-bit loads that crossed cache-line boundaries and fairly large penalties for 32-Byte loads that crossed cache line boundaries.
Haswell switched to a dual-port L1 Data Cache with 512-bit (64-Byte) interfaces, so it can support two unaligned loads of any size or alignment, as long as neither crosses a cache line boundary. When a load crosses a cache line boundary, both ports must be used to get the upper and lower parts of the requested data, and the throughput is limited to one load per cycle.