I'm newbie in SSEx and now I want perform integer calculations with SSE. All examples I found in the NET are with float arithmetics. The operations itself are not problem, for example PSUBD for 4xDWORD. But I don't know how to load/store four DWORDs in/from XMM register. By float for loading usually MOVAPS XMMn,m128 is used. Do this command load right integer values? I'm confused because Intel documentation writes the name of the instruction as
Yes, I know these instructions: they loads/stores quadwords. For example, the movdqa XMM0,MemAddr will load MemAddr - MemAddr +31 to second dword, i.e. to bits 32-63 of XMM register, MemAddr + 32 - MemAddr +63 to bits 0-31 of register, MemAddr +64 - MemAddr + 95 -> bits 96-127 of register, MemAddr +96 - MemAddr +127 -> bits 64-95 of regiter.
In other words, DWORDs 0 and 1 will be interchanged and DWORDs 2 and 3 too. Now I understand, that this does not matter because it will store the data in right order.
Example: Movdqa XMM0,Mem1 ; interchanges Movdqa XMM1, Mem2 ; interchanges in the same way Paddd XMM0, XMM2 ; adds the right pairs, i.e. equally interchanged Movdqa Mem1,XMM0 ; stores interchanging back, therefore correctly.
But a shorter variant works not correctly:
Movdqa XMM0,Mem1 ; interchanges Paddd XMM0,Mem2 ; adds the wrong pairs, because one pair interchanged, the other not. Movdqa Mem1,XMM0 ; stores this time invalid result.
You're wrong. From the HW perspecitive, numbers in memory are always in little-endian byte order, at least on x86 / x64 architectures. You can, of course, put them there in big-endian yourselves, but that's a different story...
There's absolutely no shuffling in MOVDQA / MOVDQU. In addition, these instructions don't move quadwords, but double quadwords "DQ" (128b - the whole XMM register). But it's irrelevant. Due to little-endian you can use them for all data types - BYTE, WORD, ... , DQWORD without any troubles.
You can also use their FP counterparts as well. On current CPU's (and very very likely on all future ones) there's no real functional or performance difference between MOVAPS, MOVAPD and MOVDQA; ORPS, ORPD and POR etc. The only difference is in encoding, where the "PS" instructions are one byte shorter, which may slightly improve performance under some circumstances (but I haven't measured it yet :)). However, Intel manuals advice to use FP variants for FP data and integer variants for integers.