what are the performance implications of using vmovups and vmovapd instructions respectively?

Aketh_T_1 · ‎06-20-2018

Hi all,

I see 2 instruction for virtually performing the same operations - vmovups and vmovapd as per the intel intrinsics guide (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=3,3158,3083&techs=AVX&cats=Load) except with respect to the expectation of memory alignment.

However, am very interested in understanding the performance implications of the using one of above vs the other?

The intel developers guide doesn't give us much information about this phenomenon (https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf)

Basically it only states

"Software may see performance penalties when unaligned accesses cross cacheline boundaries, so reasonable attempts to align commonly used data sets should continue to be pursued."

Is there some resource someone could point to which has some significant information particularly on this topic?

Thanks,

Aketh

McCalpinJohn · ‎06-21-2018

In older processors, there was a significant performance penalty for using the unaligned versions, even when the data was actually aligned. These penalties disappeared several generations ago, and the penalties for unaligned accesses have been reduced in each processor generation.

The best place to find historical data on this is in Agner Fog's excellent "instruction tables" document, available at http://www.agner.org/optimize/instruction_tables.pdf

As an example from that reference, looking at the MOVAPS and MOVUPS instructions for 128-bit loads from memory, the tables show that the penalty for using the MOVUPS instruction on aligned addresses disappeared

Pentium III: MOVAPS 1 instruction every 2 cycles, MOVUPS 1 instruction every 4 cycles
Pentium M: MOVAPS 1 instruction every 2 cycles, MOVUPS 1 instruction every 2 cycles
Merom/Wolfdale (Core 2): MOVAPS 1 instruction every 1 cycle, MOVUPS 1 instruction every 2 cycles
Nehalem/Westmere: 1 instruction per cycle for either instruction
Sandy Bridge and newer: 2 instructions per cycle for either instruction

In a few of the cases above, the throughput is the same, but there is a latency difference between the two versions -- this is sufficiently subtle that it can be hard to measure.