Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

movups and movupd movaps and movapd

zhangxiuxia
Beginner
3,385 Views
From intel instruction manual, movups is used to
MOVUPS-Move Unaligned Packed Single-Precision Floating-Point

But it also used to move unaligned packed double-precisi0on floating point.

140 movups (%rdi,%r13,8), %xmm1 #1.5
141 movups (%rsi,%r13,8), %xmm0 #1.5
142 mulpd %xmm0, %xmm1 #8.18
143 addpd (%rdx,%r13,8), %xmm1 #8.18
144 movaps %xmm1, (%rdx,%r13,8) #1.5





Are movups and movupd
movapds and movapd
exchangable ?


Since they both move 128bit unaligned from memory /register to memory/register ?
0 Kudos
4 Replies
styc
Beginner
3,385 Views
They are, but you save a byte with mov[au]ps.
0 Kudos
zhangxiuxia
Beginner
3,385 Views
I am sorry, I didnot understand "save a byte with mov[au]ps"

Do you mean instruction size ?
0 Kudos
sirrida
Beginner
3,385 Views
You can also use modqu / movdqa.
For optimum speed you should use the commands with the proper types, i.e. use movups/movaps for shorts, movupd/movapd for doubles and movdqu/movdqa for integers.
Also, there are some other commands which do essentially the same but behave somewhat differently concerning cache usage: movntps, movntpd, movntdq/movntdqa.
Also, more exotic things such as pshufd ymmreg,ymmreg/mem,0xe4 (Intel notation) do the job.
0 Kudos
Max_L
Employee
3,385 Views

right, using MOVUPS for any floating point type double or single (and AES instructions too, btw) is OK and recommended, MOVDQU should be used with integer types - MOVUPS is as fast as MOVAPS for _aligned_ data starting with Nehalem (aka Core i7 / Xeon 5500 etc.)

in AVX, there is an interesting and important paradigm change however, as LD+OP instructions no longer generate exceptions

i.e. in SSE:
ADDPS xmm0, [rsp+10] is the equivalent of MOVAPS xmm1, [rsp+10]; ADDPS xmm0, xmm1;
while in AVX:
VADDPS xmm0, xmm0, [rsp+10] <=> VMOVUPS xmm1, [rsp+10]; VADDPS xmm0, xmm0, xmm1;

so, in AVX, to keep uniform exception behavior (more precisely exception-less behavior) that is independent on compilers code generation it is strongly recommended to avoid using VMOVAPS/VMOVDQA instructions and _mm[256]_load_xx() intrinsics and always use VMOVUPS/VMOVDQU instructions and _mm[256]_loadu_xx() intrinsics instead, it is neutral for performance and will never surprise you (or customer) with the exception (crash) if data passed to the instructions sometimes happen to be misaligned.

Having said that, for the best performance results, please keep aligning your data.

-Max

0 Kudos
Reply