- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When I compile the following C function containing some variants of x ^= x >> n:
===>
typedef unsigned int t; // or other integer types such as char or int
t test(t x)
{
t x1;
x = x ^ (x >> 1);
x = (x >> 2) ^ x;
x ^= x >> 3;
x1 = x;
x = x >> 4;
x = x ^ x1;
x1 = x;
x = x >> 5;
x = x1 ^ x;
x1 = x;
x >>= 6;
x ^= x1;
return x;
}
<===
with maximum optimization most C-compilers generate code for *all* variants of x ^= x >> n such as (registers may vary)
mov edx,eax // make a copy of eax
shr edx,5 // modify the copy
xor eax,edx
These are 3 dependent statements whereas my preferred variant is
mov edx,eax // make a copy of eax
shr eax,5 // modify original register
xor eax,edx
where the mov and the shr command easily can be processed simultaneously.
I have tested GCC, AMD's C compiler, MS-C, and Intel-C with the same result.
A speed test revealed that my variant only costs about 2/3 of the time on Intel Atom N450 and 3/4 of the time on Intel i7. I'm quite sure that other processors behave similarly.
Code like this occurs quite often in a program. In many cases the compiler can avoid modifying a copy. It might happen that the association of registers to variables are exchanged by such optimization.
This trick is not new ... but why is it not applied? Have I missed something?
BTW: My similar inquiry in the GCC forum has not been commented on for about 2 months now, see http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48064
===>
typedef unsigned int t; // or other integer types such as char or int
t test(t x)
{
t x1;
x = x ^ (x >> 1);
x = (x >> 2) ^ x;
x ^= x >> 3;
x1 = x;
x = x >> 4;
x = x ^ x1;
x1 = x;
x = x >> 5;
x = x1 ^ x;
x1 = x;
x >>= 6;
x ^= x1;
return x;
}
<===
with maximum optimization most C-compilers generate code for *all* variants of x ^= x >> n such as (registers may vary)
mov edx,eax // make a copy of eax
shr edx,5 // modify the copy
xor eax,edx
These are 3 dependent statements whereas my preferred variant is
mov edx,eax // make a copy of eax
shr eax,5 // modify original register
xor eax,edx
where the mov and the shr command easily can be processed simultaneously.
I have tested GCC, AMD's C compiler, MS-C, and Intel-C with the same result.
A speed test revealed that my variant only costs about 2/3 of the time on Intel Atom N450 and 3/4 of the time on Intel i7. I'm quite sure that other processors behave similarly.
Code like this occurs quite often in a program. In many cases the compiler can avoid modifying a copy. It might happen that the association of registers to variables are exchanged by such optimization.
This trick is not new ... but why is it not applied? Have I missed something?
BTW: My similar inquiry in the GCC forum has not been commented on for about 2 months now, see http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48064
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I noticed the same thing in my assignment macro-op fusion research. Ironically your suggested software-level transformation would make my hardware-level optimization uneffective. Dealing with this at the hardware level potentially improves performance/Watt beyond what can be achieved through reordering the instructions.
Obviously compilers should be capable of generating either code though. You may want to contact the LLVM team about this (http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev). They won't leave it unanswered for two months...
Good luck!
Obviously compilers should be capable of generating either code though. You may want to contact the LLVM team about this (http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev). They won't leave it unanswered for two months...
Good luck!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your reply!
Macro op fusion needs some magic in the CPU and I would have bet that such a kind of optimization was state of the art for years but my benchmarks have told me otherwise...
Since a compiler should optimize for speed and size, a faster piece of code with equal size should always be preferred - strangely in this fairly simple case at least the mentioned compilers don't do it!
I'm not sure which solution is more energy preserving: Macro op fusion or superscalar execution of a mov command and another command such as shr.
For the cases where copy and modify is unavoidable, macro op fusion for sure is a good way to get it faster. Effectively mov commands would cost no time. Unfortunately this costs some die space and will only be applicable for some cases, i.e. is probably not as versatile as superscalar execution.
My first experience in AVX programming told me that most mov commands (movdqu, movdqa, movaps and the like) can be be circumvented by clever usage of the new result register and the abolition of the dreadful alignment restrictions. This yields to about 10% less code. Unfortunately I had no access to a real AVX machine yet.
Macro op fusion needs some magic in the CPU and I would have bet that such a kind of optimization was state of the art for years but my benchmarks have told me otherwise...
Since a compiler should optimize for speed and size, a faster piece of code with equal size should always be preferred - strangely in this fairly simple case at least the mentioned compilers don't do it!
I'm not sure which solution is more energy preserving: Macro op fusion or superscalar execution of a mov command and another command such as shr.
For the cases where copy and modify is unavoidable, macro op fusion for sure is a good way to get it faster. Effectively mov commands would cost no time. Unfortunately this costs some die space and will only be applicable for some cases, i.e. is probably not as versatile as superscalar execution.
My first experience in AVX programming told me that most mov commands (movdqu, movdqa, movaps and the like) can be be circumvented by clever usage of the new result register and the abolition of the dreadful alignment restrictions. This yields to about 10% less code. Unfortunately I had no access to a real AVX machine yet.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
With the brand new Haswell instructions like ANDN, BEXTR, RORX, SARX, SHLX, SHRX the mov command can be eliminated for some special cases like the one I mentioned, although in my case the variant of SHRX with constant shift is missing.
The commands effectively work like assignment macro-op fusion.
In the meantime the compiler's behavior should be seen as an optimizer bug...
The commands effectively work like assignment macro-op fusion.
In the meantime the compiler's behavior should be seen as an optimizer bug...
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page