Assignment macro-op fusion

capens__nicolas · ‎03-18-2011

Hi all,

Are micro-instructions non-destructive? If so, wouldn't it make sense to fuse an assignment and dependent arithmetic instruction into one?

a = b; a += c; -> a = b + c;

This would make up for x86's lack of non-destructive instructions. Of course compilers would have to be made aware that pairing these instructions is faster, but that seems to be a simple case of defining a non-destructive instruction pattern which is implicitly encoded as two legacy instructions.

So I wonder if there's any reason not to do this...

Cheers,

Nicolas

capens__nicolas · ‎03-18-2011

Just to clarify, it would turn the following:

8B C3 mov eax, ebx
03 C1 add eax, ecx

into:

8B C303 C1 add eax, ebx, ecx

Nothing changes at the binary level. It just decodes it as one non-destructive micro-operation.

capens__nicolas · ‎03-23-2011

Anyone? I realize this complicates the decoding a bit, but it seems like a big win in performance and power consumption to me. Or are there some additional complications I'm currently not aware of?

Is there an easy way to assess the potential performance gain? I know how to get a compiler to generate optimal code for this, but is there some freely available x86 simulator which would allow evaluating this macro-op fusion?

Thomas_W_Intel · ‎03-23-2011

ForFP instructions, AVX already provides you a solution: The VEX prefix allows a non-destructive operand, for example VADDSD xmm1, xmm2, xmm3.

capens__nicolas · ‎03-23-2011

Quoting Thomas Willhalm (Intel)

ForFP instructions, AVX already provides you a solution: The VEX prefix allows a non-destructive operand, for example VADDSD xmm1, xmm2, xmm3.

I know. I'm specifically talking about the scalar instructions. In discussions about other architectures, people claimed that x86 is crippled by the lack of non-destructive instructions and will never be able to make up for it (without a drastic redesign or lots of extra hardware which consumes more power). But since it's already largely a RISC architecture internally anyway I wondered whether simply executing a move and arithmetic operation as one instruction would make things more efficient at a minimal cost.

sirrida · ‎05-11-2011

As I had written in Copy and modify I wondered why this optimization is not state of the art. Why? Too expensive in terms of die space? Not worth the effort?
In some cases there is the possibility to circumvent the problem by making a copy but modifying the original (thereby letting original and copy change roles) and relying on superscalar execution.

A related case might be that some "complex" commands such as jecxz, loop, enter (level 0), leave are so slow although their meaning is almost trivial and they are easily outperformed by a sequence of other commands. Why? Probably because of the same reasons the gluing of mov and an arithmetic command is not performed:

It seems that RISC commands are still much faster than micro coded ones and the effort to make a glued or "complex" command a new RISC command is estimated too high for the expected gain.
Let's see what the next generations will bring...

sirrida · ‎06-12-2011

The next generation has just been presented as Haswell new instructions.
With the introduction of ANDN, BEXTR, RORX, SARX, SHLX, SHRX these new commands effectively solve our problem for some special cases, albeit with the aid of the compiler.

sirrida · ‎08-23-2011

It seems that performing the optimization as described in Copy and modify is too complicated for the compiler writers and that indeed your described hardware optimizer is more or less "mandatory".
Could it be that e.g. AMD has already implemented your proposal possibly years ago?
Has anyone done any benchmarking on other processors than i7 and Atom N450?