topic Copy and modify in Intel® ISA Extensions

Copy and modify

sirrida — Tue, 10 May 2011 22:59:39 GMT

When I compile the following C function containing some variants of x ^= x >> n:

===>
typedef unsigned int t; // or other integer types such as char or int
t test(t x)
{
t x1;

x = x ^ (x >> 1);

x = (x >> 2) ^ x;

x ^= x >> 3;

x1 = x;
x = x >> 4;
x = x ^ x1;

x1 = x;
x = x >> 5;
x = x1 ^ x;

x1 = x;
x >>= 6;
x ^= x1;

return x;
}
<===

with maximum optimization most C-compilers generate code for *all* variants of x ^= x >> n such as (registers may vary)

mov edx,eax // make a copy of eax
shr edx,5 // modify the copy
xor eax,edx

These are 3 dependent statements whereas my preferred variant is

mov edx,eax // make a copy of eax
shr eax,5 // modify original register
xor eax,edx

where the mov and the shr command easily can be processed simultaneously.
I have tested GCC, AMD's C compiler, MS-C, and Intel-C with the same result.

A speed test revealed that my variant only costs about 2/3 of the time on Intel Atom N450 and 3/4 of the time on Intel i7. I'm quite sure that other processors behave similarly.

Code like this occurs quite often in a program. In many cases the compiler can avoid modifying a copy. It might happen that the association of registers to variables are exchanged by such optimization.
This trick is not new ... but why is it not applied? Have I missed something?

BTW: My similar inquiry in the GCC forum has not been commented on for about 2 months now, see http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48064

Copy and modify

capens__nicolas — Wed, 11 May 2011 11:09:38 GMT

I noticed the same thing in my assignment macro-op fusion research. Ironically your suggested software-level transformation would make my hardware-level optimization uneffective. Dealing with this at the hardware level potentially improves performance/Watt beyond what can be achieved through reordering the instructions.

Obviously compilers should be capable of generating either code though. You may want to contact the LLVM team about this (http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev). They won't leave it unanswered for two months...

Good luck!

Copy and modify

sirrida — Wed, 11 May 2011 19:10:35 GMT

Thank you for your reply!

Macro op fusion needs some magic in the CPU and I would have bet that such a kind of optimization was state of the art for years but my benchmarks have told me otherwise...

Since a compiler should optimize for speed and size, a faster piece of code with equal size should always be preferred - strangely in this fairly simple case at least the mentioned compilers don't do it!

I'm not sure which solution is more energy preserving: Macro op fusion or superscalar execution of a mov command and another command such as shr.
For the cases where copy and modify is unavoidable, macro op fusion for sure is a good way to get it faster. Effectively mov commands would cost no time. Unfortunately this costs some die space and will only be applicable for some cases, i.e. is probably not as versatile as superscalar execution.

My first experience in AVX programming told me that most mov commands (movdqu, movdqa, movaps and the like) can be be circumvented by clever usage of the new result register and the abolition of the dreadful alignment restrictions. This yields to about 10% less code. Unfortunately I had no access to a real AVX machine yet.

Copy and modify

sirrida — Sun, 12 Jun 2011 12:29:58 GMT

With the brand new Haswell instructions like ANDN, BEXTR, RORX, SARX, SHLX, SHRX the mov command can be eliminated for some special cases like the one I mentioned, although in my case the variant of SHRX with constant shift is missing.
The commands effectively work like assignment macro-op fusion.
In the meantime the compiler's behavior should be seen as an optimizer bug...