- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I do not think there are constant registers in x86. When I define a const array, x86 access these constants from a memory but not a direct constant in instruction. Any instructions can assign a 128bit/256bit constant to a SSE/AVX register?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you talking about C programming? About facilities of some specific compiler?
How about a short example to make this specific?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
chang-li wrote:
I do not think there are constant registers in x86. When I define a const array, x86 access these constants from a memory but not a direct constant in instruction. Any instructions can assign a 128bit/256bit constant to a SSE/AVX register?
You can access XMMn registers with the help of inline assembly.This is my preffered method of SSE -aware programming.In order to load XMM register I use align 16 directive on my typedef structure which holds single precision fp and double precision fp scalar values arranged in 1D array and I use movaps instruction to directly load XMMn registers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In C language
const unsigned char data[16] = {0xFF, 0x00, 0x01, 0xFF,0xFF, 0x00, 0x01, 0xFF,0xFF, 0x00, 0x01, 0xFF,0xFF, 0x00, 0x01, 0xFF};
__m128i xmm0;
xmm0 = _mm_loadu_si128((__m128i *)data); 
In ASM it becomes
movdqa xmm0, PQDWORD PTR [esi+4]
What I expected is
movdqa xmm0, 0xFF0001FFFF0001FFFF0001FFFF0001FF
I could not find this form in assembly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>What I expected is
movdqa xmm0, 0xFF0001FFFF0001FFFF0001FFFF0001FF
I could not find this form in assembly>>>
In MASM I can load XMM register directly by using declared primitive type with DUP directive.When using inline assembly you can load directly xmm register by using array name only and without using pointer dereference operator.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you have the right expression of inline assembly below? 
movdqa xmm0, 0xFF0001FFFF0001FFFF0001FFFF0001FF
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
Iliya,
>>...In MASM I can load XMM register directly by using declared primitive type with DUP...
We really would like to see how you do it, please. Thanks in advance.
Here is the code which calculates cosine function by Taylor series expansion, please bear in mind that this code is not optimized and run slowly because of stack accesses which are not needed. I was able to load directly XMM register without dereferencing pointer.While reading the code please look at "coef" variables which are initialized to cosine series factorial denominators.
.XMM
 .STACK 4096
 .DATA
  argument REAL4 0.0,0.0,0.0,0.0
 step REAL4 0.01,0.01,0.01,0.01
 hi_bound REAL4 1.0,1.0,1.0,1.0
 lo_bound REAL4 0.0,0.0,0.0,0.0
 up_range REAL4 1.0
 lo_range REAL4 0.0
 one REAL4 1.0,1.0,1.0,1.0
 counter BYTE 147
 coef1 REAL4 2.0,2.0,2.0,2.0 ;2!
 coef2 REAL4 24.0,24.0,24.0,24.0 ;4!
 coef3 REAL4 720.0,720.0,720.0,720.0 ;6!
 coef4 REAL4 40320.0,40320.0,40320.0,40320.0 ;8!
 coef5 REAL4 3628800.0,3628800.0,3628800.0,3628800.0 ;10!
 coef6 REAL4 479001600.0,479001600.0,479001600.0,479001600.0 ;12!
 coef7 REAL4 87178291200.0,87178291200.0,87178291200.0,87178291200.0 ;14!
 coef8 REAL4 20922789888000.0,20922789888000.0,20922789888000.0,20922789888000.0 ;16!
 coef9 REAL4 6402373705728000.0,6402373705728000.0,6402373705728000.0,6402373705728000.0 ;18!
 coef10 REAL4 2432902008176640000.0,2432902008176640000.0,2432902008176640000.0,2432902008176640000.0 ;20!
 coef11 REAL4 1124000727777607680000.0,1124000727777607680000.0,1124000727777607680000.0,1124000727777607680000.0 ;22!
 coef12 REAL4 620448401733239439360000.0,620448401733239439360000.0,620448401733239439360000.0,620448401733239439360000.0;24!
 coef13 REAL4 403291461126605635584000000.0,403291461126605635584000000.0,403291461126605635584000000.0,403291461126605635584000000.0 ;26!
 loop_counter BYTE 50
 loop_counter2 BYTE 25
 loop_compare REAL4 0.5,0.5,0.5,0.5
 .DATA?
 result REAL4 147 DUP(?)
 com_lo REAL4 4 DUP(?)
 com_hi REAL4 4 DUP(?)
 start_time DWORD ?
 end_time DWORD ?
 value REAL4 ?
 upper REAL4 1.0
 lower  REAL4 0.0
 counter_compare REAL4 4 DUP(?)
 .CODE
 main PROC
 push ebp
 mov esp,ebp
 sub esp,224
 mov cl,counter
 xor eax,eax
 xor ebx,ebx
 xorps xmm2,xmm2
 xorps xmm0,xmm0
 xorps xmm1,xmm1
 movups xmm5,argument
 movups xmm0,argument
 
 
 finit
 mWrite "Please enter a starting value for cosine calculation"
 call ReadFloat
 fst value
 call Crlf
 
 fld upper
 fcom value
 fnstsw ax
 sahf
 jnb error
 
 
 
 movss xmm5,value
 movss xmm5,value
 movss xmm5,value
 movss xmm5,value
L1:
   movups xmm4,loop_compare
   movups xmm3,xmm5
   cmpps xmm3,xmm4,6
   mov eax,OFFSET counter_compare
   movups [eax],xmm3
   mov ebx,[eax]
   cmp ebx,0
   jne L2
   mov ebx,[eax+4]
   cmp ebx,0
   jne L2
   mov ebx,[eax+8]
   cmp ebx,0
   jne L2
   mov ebx,[eax+12]
   cmp ebx,0
   jne L2
   movups xmm4,loop_compare
   movups xmm3,xmm5
   cmpps xmm3,xmm4,1
   mov eax,OFFSET counter_compare
   movups [eax],xmm3
   mov ebx,[eax]
   cmp ebx,11111111111111111111111111111111b
   je L4
   mov ebx,[eax+4]
   cmp ebx,11111111111111111111111111111111b
   je L4
   mov ebx,[eax+8]
   cmp ebx,11111111111111111111111111111111b
   je L4
   mov ebx,[eax+12]
   cmp ebx,11111111111111111111111111111111b
   je L4
   xor eax,eax
   jz L3
 L2:
 mov cl,loop_counter
 xor eax,eax
 jz L3
 L4:
 mov cl,loop_counter2
 xor eax,eax
 jz L3
 L3:
 mov edx,OFFSET step
 movups xmm4,[edx]
 addps xmm5,xmm4
  
 movups xmm7,xmm5
 movups xmm0,one
 mulps xmm7,xmm7 ;x^2
 movups xmm6,xmm7
 movups [ebp-16],xmm7 ;store x^7
 movups xmm2,coef1
 rcpps xmm1,xmm2 ;1/coef1
 mulps xmm1,xmm7
 subps xmm0,xmm1 ;1-x^2/2! xmm0 accumulator
 movups xmm7,[ebp-16]
 mulps xmm7,xmm6 ;x^4
 movups [ebp-32],xmm7 ;store x^4
 movups xmm2,coef2
 rcpps xmm1,xmm2 ;1/coef2
 mulps xmm1,xmm7
 addps xmm0,xmm1 ;1-1x^2/2!+x^4/4!
 movups xmm7,[ebp-32]
 mulps xmm7,xmm6 ;x^6
 movups [ebp-48],xmm7 ;store x^6
 movups xmm2,coef3
 rcpps xmm1,xmm2 ;1/coef3
 mulps xmm1,xmm7
 subps xmm0,xmm1 ;1-x^2/2!+x^4/4!-x^6/6!
 movups xmm7,[ebp-48]
 mulps xmm7,xmm6 ;x^8
 movups [ebp-64],xmm7 ;store x^8
 movups xmm2,coef4
 rcpps xmm1,xmm2 ;1/coef4
 mulps xmm1,xmm7
 addps xmm0,xmm1 ;1-x^2/2!+x^4/4!-x^6/6!+x^8/8!
 movups xmm7,[ebp-64]
 mulps xmm7,xmm6 ;x^10
 movups [ebp-80],xmm7 ;store x^10
 movups xmm2,coef5
 rcpps xmm1,xmm2 ;1/coef5
 mulps xmm1,xmm7
 subps xmm0,xmm1 ;1-x^2/2!+x^4/4!-x^6/6!+x^8/8!-x^10/10!
 movups xmm7,[ebp-80]
 mulps xmm7,xmm6 ;x^12
 movups [ebp-96],xmm7 ;store x^12
 movups xmm2,coef6      <---    XMM REGISTER IS DIRECTLY LOADED BY INITIALIZED COEF ARGUMENT
 rcpps xmm1,xmm2 ;1/coef6
 mulps xmm1,xmm7
 addps xmm0,xmm1 ;1-x^2/2!+x^4/4!-x^6/6!+x^8/8!-x^10/10!+x^12/12!
 movups xmm7,[ebp-96]
 mulps xmm7,xmm6;x^14
 movups [ebp-112],xmm7 ;store x^14
 movups xmm2,coef7        <---    XMM REGISTER IS DIRECTLY LOADED BY INITIALIZED COEF ARGUMENT
 rcpps xmm1,xmm2 ;1/coef7
 mulps xmm1,xmm7
 subps xmm0,xmm1 ;1-x^2/2!+x^4/4!-x^6/6!+x^8/8!-x^10/10!+x^12/12!-x^14/14!
 movups xmm7,[ebp-112]
 mulps xmm7,xmm6 ;x^16
 movups [ebp-128],xmm7 ;store x^16
 movups xmm2,coef8
 rcpps xmm1,xmm2 ;1/coef8
 mulps xmm1,xmm7
 addps xmm0,xmm1 ;1-x^2/2!+x^4/4!-x^6/6!+x^8/8!-x^10/10!+x^12/12!-x^14/14!+x^16/16!
 movups xmm7,[ebp-128]
 mulps xmm7,xmm6 ;x^18
 movups [ebp-144],xmm7;store x^18
 movups xmm2,coef9
 rcpps xmm1,xmm2 ;1/coef9
 mulps xmm1,xmm7
 subps xmm0,xmm1 ;1-x^2/2!+x^4/4!-x^6/6!+x^8/8!-x^10/10!+x^12/12!-x^14/14!+x^16/16!-x^18/18!
 movups xmm7,[ebp-144]
 mulps xmm7,xmm6 ;x^20
 movups [ebp-160],xmm7 ;store x^20
 movups xmm2,coef10
 rcpps xmm1,xmm2 ;1/coef10
 mulps xmm1,xmm7
 addps xmm0,xmm1 ;1-x^2/2!+x^4/4!-x^6/6!+x^8/8!-x^10/10!+x^12/12!-x^14/14!+x^16/16!-x^18/18!+x^20/20!
 movups xmm7,[ebp-160]
 mulps xmm7,xmm6 ;x^22
 movups [ebp-176],xmm7 ;store x^22
 movups xmm2,coef11
 rcpps xmm1,xmm2 ;1/coef11
 mulps xmm1,xmm7
 subps xmm0,xmm1;1-x^2/2!+x^4/4!-x^6/6!+x^8/8!-x^10/10!+x^12/12!-x^14/14!+x^16/16!-x^18/18!+x^20/20!-x^22/22! 
 movups xmm7,[ebp-176]
 mulps xmm7,xmm6 ;x^24
 movups [ebp-192],xmm7 ;store x^24
 movups xmm2,coef12
 rcpps xmm1,xmm2 ;1/coef12
 mulps xmm1,xmm7
 addps xmm0,xmm1 ; +x^24/24!
 movups xmm7,[ebp-192]
 mulps xmm7,xmm6 ;x^26
 movups xmm2,coef13
 rcpps xmm1,xmm2 ;1/coef13
 mulps xmm1,xmm7
 subps xmm0,xmm1
 mov ebx,OFFSET result 
 
 movups [ebx],xmm0
 fld  DWORD PTR[ebx]
 call WriteFloat
 call Crlf
 sub cl,1
 jnz L3
 xor eax,eax
 jz L5
 error:
 movups xmm5,argument
 xor eax,eax
 jz L3 
 L5:
 exit
 main ENDP
 END main
 
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My code works with movups
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>It means, that a set of values 1.0,1.0,1.0,1.0 is moved from a memory (!) to xmm0 register. Once again, both instructions, that is movdqa and movups, can not work with constants by design>>>
I misunderstood the problem.In my code the load is coming from the memory and this was not the thread starter's question.The problem is "how to load XMM register with the immediate value".
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>// _asm MOVDQA xmm0, 0xFFFFFFFFFFFFFFFF // error C2415: improper operand type>>>
Now I remember I have the same situation when I tried to load directly(immediate value)XMM registers.The second test is my preffered method of loading 1D vector represented by the 2 or 4 elements array into XMM register.You can also load the registers when passing structure members.For this use unaligned movups
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
Exactly and I'd like to change my former statement to:
>>...both instructions, that is movdqa and movups, can not work with literal constants by design...
Sadly Intel processor designers decided to not allow loading SSEn registers with the immediate values.I would like to know what is the cause of such a decision.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>one REAL4 1.0,1.0,1.0,1.0......Note: memory is allocated here / It is Not a literal constant
>>movups xmm0, one
It also assing xmm0 to a value of a vector with 4 components.
"how to load XMM register with the immediate value"
yes, this is the exact question. Looks answer is no. So following code
movups xmm0,one
because the data one was loaded from memory, the algorithm is not all register based, in which the cache access may be related. The performance will be unexpected. The optimization on SSE/AVX may be collapsed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Because of the design of those SSEn load/store instructions there will be some performance penalty when the data needs to be loaded first time from the memory.
Loading of the XMM register with the pre-initialized structure member.
SinVector sinvec1 = {-0.1666666,-0.1666666,-0.1666666,-0.1666666},*sinvec1ptr; sinvec1ptr = &sinvec1 // structure initialization
Loading of the member
movups xmm1,sinvec1
By using custom typedef structure of array data type which is aligned on 16 bytes boundaries I can use movaps
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Because of the design of those SSEn load/store instructions there will be some performance penalty when the data needs to be loaded first time from the memory.
Loading of the XMM register with the pre-initialized structure member:
SinVector sinvec1 = {-0.1666666,-0.1666666,-0.1666666,-0.1666666},*sinvec1ptr; sinvec1ptr = &sinvec1 // structure initialization
Loading of the member:
movups xmm1,sinvec1
By using custom typedef structure of array data type which is aligned on 16 bytes boundaries I can use movaps
 
					
				
				
			
		
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page