- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I do not think there are constant registers in x86. When I define a const array, x86 access these constants from a memory but not a direct constant in instruction. Any instructions can assign a 128bit/256bit constant to a SSE/AVX register?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why not set /arch:SSE2 or AVX for the Microsoft compiler? Mixed x87 and SSE2 will never produce high quality code generation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>I agree that the question was answered completely. Thanks.>>>
Agree with you that question is answered completely.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Code generation is another problem. I do not think any C compilers can generate the best code. I am more concerned on following asm code in a kernel that will be executed thousands and thousands times. I want to all memory access to be removed from kernel function. For example,
init:
// _asm MOVDQA xmm, xmmword ptr [ mmValue0 ]
00404731 movdqa xmm8,xmmword ptr [mmValue0]
kernel:
use xmm8 as a constant
A solution requires to use another group of registers xmm8-xmm15 or ymm8-ymm15. Unfortunately this must be 64-bit x64 platform. So the code like the example of iliyapolak and the optimized AVX IDCT code from Intel http://software.intel.com/en-us/articles/using-intel-advanced-vector-extensions-to-implement-an-inverse-discrete-cosine-transform can be faster.
If x86 has constant registers and allow direct constant loading, there is no need to further optimization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>A solution requires to use another group of registers xmm8-xmm15 or ymm8-ymm15. Unfortunately this must be 64-bit x64 platform. So the code like the example of iliyapolak>>>
Even without the usage of additional XMM8-XMM15 and YMMn registers my code which was presented as an example can be effectively optimized.If you are intrested please follow this link http://software.intel.com/en-us/forums/topic/347470
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>am more concerned on following asm code in a kernel that will be executed thousands and thousands times. I want to all memory access to be removed from kernel function>>>
What do you mean by "kernel" and "kernel function"?Are you referring to the kernel mode of operation or maybe some kind of mathematical kernel (like a gaussian) which has to operate on some data set.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>If x86 has constant registers and allow direct constant loading>>>
Afaik so called constant register on some microarchitecture are read-only registers which holds a constant values like zero or pi.I do not think that x86 gp and SSE registers can be classified as a constant read only registers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Even without the usage of additional XMM8-XMM15 and YMMn registers my code which was presented as an example can be effectively optimized.If you are intrested please follow this link http://software.intel.com/en-us/forums/topic/347470"
In your code all constants are still loaded from memory. Suppose the sin function will be called million times these constants will be loaded million times. In contrast to load these constants once to registers you see the saving. Since you need to seal sin function as a build-in I do not know how to accelerate it by constant access.
"What do you mean by "kernel" and "kernel function"?Are you referring to the kernel mode of operation or maybe some kind of mathematical kernel (like a gaussian) which has to operate on some data set."
The kernel is borrowed from OpenCL. It is not the kernel mode of OS. All these discussions are OS independent.
Afaik so called constant register on some microarchitecture are read-only registers which holds a constant values like zero or pi.I do not think that x86 gp and SSE registers can be classified as a constant read only registers.
Forget constant register and literal constant they are not for SSE/AVX. However literal constant can be applied for 32-bit (and 64-bit?) general registers. We see the boundary of SSE/AVX.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"But the code is in memory anyway (!), already cached in L1 line, etc."
What this mean? Instruction cache and Data cache are separated. Below is the data from Appendix C.
The movdqa and movups for register operation are super fast in Core architecture (Latency 1) but slower in P4 (Latency 6).
1. Table C-9. Streaming SIMD Extension 2 128-bit Integer Instructions
0f_2h is the P4 Prescott
Instruction Latency1 Throughput Execution Unit2
CPUID 0F_3H 0F_2H 0F_3H 0F_2H 0F_2H
MOVDQA xmm, xmm 6 6 1 1 FP_MOVE
MOVDQU xmm, xmm 6 6 1 1 FP_MOVE
2. Table C-9a. Streaming SIMD Extension 2 128-bit Integer Instructions
Intel microarchitecture code name Westmere are represented by 06_25H, 06_2CH and 06_2FH. Intel microarchitecture code name Sandy Bridge
are represented by 06_2AH.
Instruction Latency1 Throughput
CPUID
06_2A
06_2D
06_25/
2C/1A/
1E/1F/
2E/2F
06_17H,
06_1DH
06_0F
H
06_2A
06_2D
06_25/
2C/1A/
1E/1F/
2E/2F
06_17H,
06_1DH
06_0F
H
MOVDQA xmm,xmm
1 1 1 1 0.33 0.33 0.33 0.33
MOVDQU xmm,xmm
1 1 1 1 0.33 0.33 0.33 0.5
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
While you can't load immediate values to xmm registers, you can load immediates into general purpose 32/64 bit registers and then use movd/movq (and shuffle/broadcast is needed) to initialize xmm/ymm registers with them. I wonder why compilers (at least, gcc) don't do that when they generate floating point code that involves constants. Perhaps, because it turns out to be slower than a regular load from memory?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>In your code all constants are still loaded from memory. Suppose the sin function will be called million times these constants will be loaded million times. In contrast to load these constants once to registers you see the saving>>>
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
informative post...:)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Does -fforce-addr option is also related to SIMD registers?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
>>...Does -fforce-addr option is also related to SIMD registers?
Command line help for the MinGW C++ compiler doesn't specify it. A Manual provides a little bit more information:
...
`-fforce-addr'
Force memory address constants to be copied into registers before
doing arithmetic on them. This may produce better code just as
`-fforce-mem' may.
...
but it is still Not clear what registers will be used.
I guess it can not be applied to AVX registers because there are no these instructions.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page