Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Copy and modify

sirrida
Beginner
540 Views
When I compile the following C function containing some variants of x ^= x >> n:

===>
typedef unsigned int t; // or other integer types such as char or int
t test(t x)
{
t x1;

x = x ^ (x >> 1);

x = (x >> 2) ^ x;

x ^= x >> 3;

x1 = x;
x = x >> 4;
x = x ^ x1;

x1 = x;
x = x >> 5;
x = x1 ^ x;

x1 = x;
x >>= 6;
x ^= x1;

return x;
}
<===

with maximum optimization most C-compilers generate code for *all* variants of x ^= x >> n such as (registers may vary)

mov edx,eax // make a copy of eax
shr edx,5 // modify the copy
xor eax,edx

These are 3 dependent statements whereas my preferred variant is

mov edx,eax // make a copy of eax
shr eax,5 // modify original register
xor eax,edx

where the mov and the shr command easily can be processed simultaneously.
I have tested GCC, AMD's C compiler, MS-C, and Intel-C with the same result.

A speed test revealed that my variant only costs about 2/3 of the time on Intel Atom N450 and 3/4 of the time on Intel i7. I'm quite sure that other processors behave similarly.

Code like this occurs quite often in a program. In many cases the compiler can avoid modifying a copy. It might happen that the association of registers to variables are exchanged by such optimization.
This trick is not new ... but why is it not applied? Have I missed something?

BTW: My similar inquiry in the GCC forum has not been commented on for about 2 months now, see http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48064
0 Kudos
3 Replies
capens__nicolas
New Contributor I
540 Views
I noticed the same thing in my assignment macro-op fusion research. Ironically your suggested software-level transformation would make my hardware-level optimization uneffective. Dealing with this at the hardware level potentially improves performance/Watt beyond what can be achieved through reordering the instructions.

Obviously compilers should be capable of generating either code though. You may want to contact the LLVM team about this (http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev). They won't leave it unanswered for two months...

Good luck!
0 Kudos
sirrida
Beginner
540 Views
Thank you for your reply!

Macro op fusion needs some magic in the CPU and I would have bet that such a kind of optimization was state of the art for years but my benchmarks have told me otherwise...

Since a compiler should optimize for speed and size, a faster piece of code with equal size should always be preferred - strangely in this fairly simple case at least the mentioned compilers don't do it!

I'm not sure which solution is more energy preserving: Macro op fusion or superscalar execution of a mov command and another command such as shr.
For the cases where copy and modify is unavoidable, macro op fusion for sure is a good way to get it faster. Effectively mov commands would cost no time. Unfortunately this costs some die space and will only be applicable for some cases, i.e. is probably not as versatile as superscalar execution.

My first experience in AVX programming told me that most mov commands (movdqu, movdqa, movaps and the like) can be be circumvented by clever usage of the new result register and the abolition of the dreadful alignment restrictions. This yields to about 10% less code. Unfortunately I had no access to a real AVX machine yet.
0 Kudos
sirrida
Beginner
540 Views
With the brand new Haswell instructions like ANDN, BEXTR, RORX, SARX, SHLX, SHRX the mov command can be eliminated for some special cases like the one I mentioned, although in my case the variant of SHRX with constant shift is missing.
The commands effectively work like assignment macro-op fusion.
In the meantime the compiler's behavior should be seen as an optimizer bug...
0 Kudos
Reply