I see 2 instruction for virtually performing the same operations - vmovups and vmovapd as per the intel intrinsics guide (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=3,3158,3083&techs=AVX&cats=Load) except with respect to the expectation of memory alignment.
However, am very interested in understanding the performance implications of the using one of above vs the other?
The intel developers guide doesn't give us much information about this phenomenon (https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf)
Basically it only states
"Software may see performance penalties when unaligned accesses cross cacheline boundaries, so reasonable attempts to align commonly used data sets should continue to be pursued."
Is there some resource someone could point to which has some significant information particularly on this topic?
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
In older processors, there was a significant performance penalty for using the unaligned versions, even when the data was actually aligned. These penalties disappeared several generations ago, and the penalties for unaligned accesses have been reduced in each processor generation.
The best place to find historical data on this is in Agner Fog's excellent "instruction tables" document, available at http://www.agner.org/optimize/instruction_tables.pdf
As an example from that reference, looking at the MOVAPS and MOVUPS instructions for 128-bit loads from memory, the tables show that the penalty for using the MOVUPS instruction on aligned addresses disappeared
- Pentium III: MOVAPS 1 instruction every 2 cycles, MOVUPS 1 instruction every 4 cycles
- Pentium M: MOVAPS 1 instruction every 2 cycles, MOVUPS 1 instruction every 2 cycles
- Merom/Wolfdale (Core 2): MOVAPS 1 instruction every 1 cycle, MOVUPS 1 instruction every 2 cycles
- Nehalem/Westmere: 1 instruction per cycle for either instruction
- Sandy Bridge and newer: 2 instructions per cycle for either instruction
In a few of the cases above, the throughput is the same, but there is a latency difference between the two versions -- this is sufficiently subtle that it can be hard to measure.