Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
1103 Discussions

what are the performance implications of using vmovups and vmovapd instructions respectively?


Hi all,

I see 2 instruction for virtually performing the same operations - vmovups and vmovapd as per the intel intrinsics guide (,3158,3083&techs=AVX&cats=Load) except with respect to the expectation of memory alignment.

However,  am very interested in understanding the performance implications of the using one of above vs the other?

The intel developers guide doesn't give us much information about this phenomenon (

Basically it only states

"Software may see performance penalties when unaligned accesses cross cacheline boundaries, so reasonable attempts to align commonly used data sets should continue to be pursued."

Is there some resource someone could point to which has some significant information  particularly on this topic?



0 Kudos
1 Reply
Honored Contributor III

In older processors, there was a significant performance penalty for using the unaligned versions, even when the data was actually aligned.   These penalties disappeared several generations ago, and the penalties for unaligned accesses have been reduced in each processor generation.

The best place to find historical data on this is in Agner Fog's excellent "instruction tables" document, available at

As an example from that reference, looking at the MOVAPS and MOVUPS instructions for 128-bit loads from memory, the tables show that the penalty for using the MOVUPS instruction on aligned addresses disappeared

  • Pentium III: MOVAPS 1 instruction every 2 cycles,   MOVUPS 1 instruction every 4 cycles
  • Pentium M: MOVAPS 1 instruction every 2 cycles,   MOVUPS 1 instruction every 2 cycles
  • Merom/Wolfdale (Core 2):  MOVAPS 1 instruction every 1 cycle,   MOVUPS 1 instruction every 2 cycles
  • Nehalem/Westmere:  1 instruction per cycle for either instruction
  • Sandy Bridge and newer: 2 instructions per cycle for either instruction

In a few of the cases above, the throughput is the same, but there is a latency difference between the two versions -- this is sufficiently subtle that it can be hard to measure.

0 Kudos