Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

How to assign a constant?

CLi37
Beginner
4,548 Views

I do not think there are constant registers in x86. When I define a const array, x86 access these constants from a memory but not a direct constant in instruction. Any instructions can assign a 128bit/256bit constant to a SSE/AVX register? 

0 Kudos
46 Replies
SergeyKostrov
Valued Contributor II
1,120 Views
>>"how to load XMM register with the immediate value" >> >>yes, this is the exact question. Looks answer is no. I think we're shifting our discussion to a different subject related to a quality of codes generation. Are we really interested in that? Please take a look at a disassembled test case: [ Intel C++ compiler ( it is more compact ) ] ... // __m128 mmValue0 = { 1.0L, 2.0L, 3.0L, 4.0L }; 0040471B movaps xmm0,xmmword ptr [dValue+60h (5D8320h)] 00404722 movaps xmmword ptr [mmValue0],xmm0 // __m128 mmValue1 = { 5.0L, 6.0L, 7.0L, 8.0L }; 00404726 movaps xmm0,xmmword ptr [dValue+70h (5D8330h)] 0040472D movaps xmmword ptr [mmValue1],xmm0 // _asm MOVDQA xmm0, 0xFFFFFFFFFFFFFFFF // Error C2415: improper operand type // _asm MOVUPS xmm1, 0xFFFFFFFFFFFFFFFF // Error C2415: improper operand type // _asm MOVDQA xmm0, xmmword ptr [ mmValue0 ] 00404731 movdqa xmm0,xmmword ptr [mmValue0] // _asm MOVUPS xmm1, xmmword ptr [ mmValue1 ] 00404736 movups xmm1,xmmword ptr [mmValue1] // _asm MOVDQA xmmword ptr [ mmValue0 ], xmm1 0040473A movdqa xmmword ptr [mmValue0],xmm1 // _asm MOVUPS xmmword ptr [ mmValue1 ], xmm0 0040473F movups xmmword ptr [mmValue1],xmm0 ... [ Microsoft C++ compiler ] ... // __m128 mmValue0 = { 1.0L, 2.0L, 3.0L, 4.0L }; 0043686D fld1 0043686F fstp dword ptr [mmValue0] 00436872 fld dword ptr [__real@40000000 (49AFDCh)] 00436878 fstp dword ptr [ebp-1Ch] 0043687B fld dword ptr [__real@40400000 (49AFD8h)] 00436881 fstp dword ptr [ebp-18h] 00436884 fld dword ptr [__real@40800000 (49B064h)] 0043688A fstp dword ptr [ebp-14h] // __m128 mmValue1 = { 5.0L, 6.0L, 7.0L, 8.0L }; 0043688D fld dword ptr [__real@40a00000 (49B060h)] 00436893 fstp dword ptr [mmValue1] 00436896 fld dword ptr [__real@40c00000 (49B05Ch)] 0043689C fstp dword ptr [ebp-3Ch] 0043689F fld dword ptr [__real@40e00000 (49B058h)] 004368A5 fstp dword ptr [ebp-38h] 004368A8 fld dword ptr [__real@41000000 (49B054h)] 004368AE fstp dword ptr [ebp-34h] // _asm MOVDQA xmm0, 0xFFFFFFFFFFFFFFFF // Error C2415: improper operand type // _asm MOVUPS xmm1, 0xFFFFFFFFFFFFFFFF // Error C2415: improper operand type // _asm MOVDQA xmm0, xmmword ptr [ mmValue0 ] 004368B1 movdqa xmm0,xmmword ptr [mmValue0] // _asm MOVUPS xmm1, xmmword ptr [ mmValue1 ] 004368B6 movups xmm1,xmmword ptr [mmValue1] // _asm MOVDQA xmmword ptr [ mmValue0 ], xmm1 004368BA movdqa xmmword ptr [mmValue0],xmm1 // _asm MOVUPS xmmword ptr [ mmValue1 ], xmm0 004368BF movups xmmword ptr [mmValue1],xmm0 ... I agree that the question was answered completely. Thanks.
0 Kudos
TimP
Honored Contributor III
1,120 Views

Why not set /arch:SSE2 or AVX for the Microsoft compiler?  Mixed x87 and SSE2 will never produce high quality code generation.

0 Kudos
Bernard
Valued Contributor I
1,120 Views

>>>I agree that the question was answered completely. Thanks.>>>

Agree with you that question is answered completely.

0 Kudos
CLi37
Beginner
1,120 Views

Code generation is another problem. I do not think any C compilers can generate the best code. I am more concerned on following asm code in a kernel that will be executed thousands and thousands times. I want to all memory access to be removed from kernel function. For example,

init:

// _asm MOVDQA xmm, xmmword ptr [ mmValue0 ]
00404731 movdqa xmm8,xmmword ptr [mmValue0]

kernel:

use xmm8 as a constant

A solution requires to use another group of registers xmm8-xmm15 or ymm8-ymm15. Unfortunately this must be 64-bit x64 platform. So the code like the example of iliyapolak and the optimized AVX IDCT code from Intel  http://software.intel.com/en-us/articles/using-intel-advanced-vector-extensions-to-implement-an-inverse-discrete-cosine-transform can be faster.

If x86 has constant registers and allow direct constant loading, there is no need to further optimization.  

0 Kudos
Bernard
Valued Contributor I
1,120 Views

>>>A solution requires to use another group of registers xmm8-xmm15 or ymm8-ymm15. Unfortunately this must be 64-bit x64 platform. So the code like the example of iliyapolak>>>

Even without the usage of additional XMM8-XMM15 and YMMn registers my code which was presented as an example can be effectively optimized.If you are intrested please follow this link http://software.intel.com/en-us/forums/topic/347470

0 Kudos
Bernard
Valued Contributor I
1,120 Views

>>>am more concerned on following asm code in a kernel that will be executed thousands and thousands times. I want to all memory access to be removed from kernel function>>>

What do you mean by "kernel" and "kernel function"?Are you referring to the kernel mode of operation or maybe  some kind of mathematical kernel (like a gaussian) which has to operate on some data set.

0 Kudos
Bernard
Valued Contributor I
1,120 Views

>>>If x86 has constant registers and allow direct constant loading>>>

Afaik so called constant register on some microarchitecture are read-only registers which holds a constant values like zero or pi.I do not think that x86 gp and SSE registers can be classified as a constant read only registers.

0 Kudos
CLi37
Beginner
1,120 Views

"Even without the usage of additional XMM8-XMM15 and YMMn registers my code which was presented as an example can be effectively optimized.If you are intrested please follow this link http://software.intel.com/en-us/forums/topic/347470"

In your code all constants are still loaded from memory. Suppose the sin function will be called million times these constants will be loaded million times. In contrast to load these constants once to registers you see the saving. Since you need to seal sin function as a build-in I do not know how to accelerate it by constant access. 

"What do you mean by "kernel" and "kernel function"?Are you referring to the kernel mode of operation or maybe  some kind of mathematical kernel (like a gaussian) which has to operate on some data set."

The kernel is borrowed from OpenCL. It is not the kernel mode of OS. All these discussions are OS independent.

Afaik so called constant register on some microarchitecture are read-only registers which holds a constant values like zero or pi.I do not think that x86 gp and SSE registers can be classified as a constant read only registers.

Forget constant register and  literal constant they are not for SSE/AVX. However literal constant can be applied for 32-bit (and 64-bit?) general registers. We see the boundary of SSE/AVX.  

0 Kudos
SergeyKostrov
Valued Contributor II
1,120 Views
>>...Code generation is another problem. I do not think any C compilers can generate the best code. Have you ever worked with Watcom C/C++ compiler ( older versions 9.x or 10.x )? I regret to see that it is almost forgotten. Sorry for a small deviation from the subject of the thread. >>...I am more concerned on following asm code in a kernel that will be executed thousands and thousands times. I want to all >>memory access to be removed from kernel function... But the code is in memory anyway (!), already cached in L1 line, etc. I simply would like to refer you to some performance numbers and please take a look at: Intel(R) 64 and IA-32 Architectures Optimization Reference Manual Order Number: 248966-026 April 2012 APPENDIX C INSTRUCTION LATENCY AND THROUGHPUT Pages 746 and 748 for movdqa instruction Pages 756 and 758 for movups instruction
0 Kudos
CLi37
Beginner
1,120 Views

"But the code is in memory anyway (!), already cached in L1 line, etc."

What this mean? Instruction cache and Data cache are separated. Below is the data from Appendix C.

The movdqa and movups for register operation are super fast in Core architecture (Latency 1) but slower in P4 (Latency 6). 

1. Table C-9. Streaming SIMD Extension 2 128-bit Integer Instructions

0f_2h is the P4 Prescott

Instruction Latency1 Throughput Execution Unit2
CPUID 0F_3H 0F_2H 0F_3H 0F_2H 0F_2H

MOVDQA xmm, xmm 6 6 1 1 FP_MOVE
MOVDQU xmm, xmm 6 6 1 1 FP_MOVE

2. Table C-9a. Streaming SIMD Extension 2 128-bit Integer Instructions

Intel microarchitecture code name Westmere are represented by 06_25H, 06_2CH and 06_2FH. Intel microarchitecture code name Sandy Bridge
are represented by 06_2AH.

Instruction Latency1 Throughput
CPUID
06_2A
06_2D
06_25/
2C/1A/
1E/1F/
2E/2F
06_17H,
06_1DH
06_0F
H
06_2A
06_2D
06_25/
2C/1A/
1E/1F/
2E/2F
06_17H,
06_1DH
06_0F
H

MOVDQA xmm,xmm
1 1 1 1 0.33 0.33 0.33 0.33
MOVDQU xmm,xmm
1 1 1 1 0.33 0.33 0.33 0.5

0 Kudos
andysem
New Contributor III
1,120 Views

While you can't load immediate values to xmm registers, you can load immediates into general purpose 32/64 bit registers and then use movd/movq (and shuffle/broadcast is needed) to initialize xmm/ymm registers with them. I wonder why compilers (at least, gcc) don't do that when they generate floating point code that involves constants. Perhaps, because it turns out to be slower than a regular load from memory?

0 Kudos
SergeyKostrov
Valued Contributor II
1,120 Views
>>...Perhaps, because it turns out to be slower than a regular load from memory?.. It is a good point but it is so far a very speculative until it is proven in a small test-case. 3-step-operation A: CONSTANT -> load to a regular register -> load to XMM register vs. 3-step-operation B: CONSTANT -> load to a memory -> load to XMM register
0 Kudos
Bernard
Valued Contributor I
1,120 Views

>>>In your code all constants are still loaded from memory. Suppose the sin function will be called million times these constants will be loaded million times. In contrast to load these constants once to registers you see the saving>>>

 

0 Kudos
Bernard
Valued Contributor I
1,120 Views
@chang-li It seems that I have a problem with posting my message.This message is reply to your quoted sentence in my previous post. My intention was to optimize sine function calculation.It was done by coefficient precalculation and Horner scheme implementation.As you pointed out in your response the problem lies in keeping and loading a constant coefficients from the memory.One of the solution is to use remaining Xmm registers solely for the purpose of coeffs storage.For single sine function call it could be good solution albeit at reduced accuracy,but for millions of sine function calls the executing thread can be prempted and other floating point code can be scheduled to run thus overwriting XMM registers.
0 Kudos
Bernard
Valued Contributor I
1,120 Views
>>>Forget constant register and literal constant they are not for SSE/AVX. However literal constant can be applied for 32-bit (and 64-bit?) general registers. We see the boundary of SSE/AVX>>> Completely agree with you.
0 Kudos
SergeyKostrov
Valued Contributor II
1,120 Views
Haven't we discussed everything, guys? Good Luck!
0 Kudos
tom_w_
Beginner
1,120 Views

informative post...:)

0 Kudos
SergeyKostrov
Valued Contributor II
1,120 Views
This is a short follow up. I just found that some GCC-like C++ compilers have an option: ... -fforce-addr - Copy memory address constants into registers before use ...
0 Kudos
Bernard
Valued Contributor I
1,120 Views

Does -fforce-addr option is also related to SIMD registers?

0 Kudos
SergeyKostrov
Valued Contributor II
1,061 Views
>>...Does -fforce-addr option is also related to SIMD registers? Command line help for the MinGW C++ compiler doesn't specify it. A Manual provides a little bit more information: ... `-fforce-addr' Force memory address constants to be copied into registers before doing arithmetic on them. This may produce better code just as `-fforce-mem' may. ... but it is still Not clear what registers will be used.
0 Kudos
CLi37
Beginner
1,061 Views

Sergey Kostrov wrote:

>>...Does -fforce-addr option is also related to SIMD registers?

Command line help for the MinGW C++ compiler doesn't specify it. A Manual provides a little bit more information:
...
`-fforce-addr'
Force memory address constants to be copied into registers before
doing arithmetic on them. This may produce better code just as
`-fforce-mem' may.
...
but it is still Not clear what registers will be used.

I guess it can not be applied to AVX registers because there are no these instructions.  

0 Kudos
Reply