Let me propose a replacement encoding for loading tiny values, "mov reg,const". By utilizing the unused and reserved form "lea reg,reg" (8d 11rrrxxx) I propose the following meaning:
rrr is the target register and xxx shall be encoded as 000: a signed byte follows, i.e. -128..127 001: 1 010: 2 011: maximum signed value (for 32 bit: 0x7fffffff) 100: a signed word follows, i.e. -32768..32767 101: minimum signed value (for 32 bit: 0x80000000) 110: -2 111: -1
Examples: "mov eax,2": 8d c2 (2 bytes) "mov eax,5": 8d c0 05 (3 bytes) "mov eax,32767": 8d c4 ff 7f (4 bytes) All these encodings are better than b8 xx xx xx xx (5 bytes).
Needless to say that the usual prefixes for word and qword should work as well. Example: "mov rax,-2": 48 8d fe (3 bytes) instead of 48 c7 c0 fe ff ff ff (7 bytes)
Since this kind of mov command is used very often I see much of an improvement.
I really don't think the decoding complication is worth the instruction size reduction. This is the RISC vs. CISC debate all over again. And I'm pretty sure the transistors can be invested into something that offers greater benefit. Constants almost always get propagated by the compiler anyway, so these instructions don't occur nearly as often as you seem to think.
Another critically important thing to take into account is that you don't want to break compatibility without strong justification. There's a very real risk that several years from now people carelessly compile their code using an extension like this, and it would crash on a large number of systems without support for this extension. Intel would be blamed for adding instructions that really aren't that critical.
Please don't let this discourage you. You clearly care about performance and carefully studied the encoding formats. But try to look at the bigger picture and think long term. Personally I think multi-core and throughput computing poses much bigger challenges. Perhaps a 'voting' instruction and gather/scatter support help in these areas, but I'm sure other suggestions exist that are worth exploring...
Well, the need to have denser encoding seems to be there since I have sometimes seen replacements for assigning small constants: * 0: xor reg,reg (same register, 2 bytes): Specially optimized by processor! This is the reason I left out 0. * -1: or reg,-1 (3 bytes) * -128..127: "push imm8 / pop reg" (3 bytes): Obviously very slow because of accessing memory. The special case of 0 is *very* common and (almost?) always treated specially. In my experience assignments of small constants to register occur quite often. Nearly all the arithmetic commands have a special notation for 8 bit immediate values: and,or,xor,cmp,add,adc,sub,sbb - and there is inc/dec...
BTW: In the proposal I have changed the constants 3 and -3 to min/max value which seem to be much more common.
As an example for an introduction of a new command, AMD has introduced lzcnt which is quite easily replaced by a small sequence around bsr. I see no real advantage of this new command, i.e. it is needed quite seldom and it is replaced easily.
My proposal does not break compatibility since a coding is used which traps as an illegal opcode.
What do you mean by 'voting' instruction? Is there a place where I can vote for new instructions?
For all new or changed commands your arguments apply. One must pay attention when using new commands. Nowadays compilers assume cmov and sse2 to be present. What, if I use my old Pentium II laptop?
After all this is only a proposal, in other words a hint that things can be enhanced - and it is not us who decide if the proposal will be implemented. We'll see what happens - I hope that Intel and AMD folks are listening.
Indeed setting a register to zero is very common, and using a xor for that also helps because the register renamer recognizes that there's no dependency. But other values don't occur nearly as often, or get propaged into an arithmetic operation. As you noted yourself, arithmetic operations can work with 8-bit immediate values because they suffice in many cases.
Also, just because some compilers use clever tricks to set other register values using less instruction bytes, doens't mean it's really all that necessary.
LZCNT is a different story because it actually replaces a sequence of 4 instructions. It's a whole lot faster, and I can imagine that for some very specific algorithms like compression, encryption and/or scientific computing it makes a significant difference. I bet supercomputing clients have a big say in which instructions are considered as an extension.
Indeed your proposal doesn't break compatibility, but using the same opcode for different purposes seems at least a bit problematic to me. Also note that the pre-decoder has to quickly determine the length of each instruction, so any deviation from the standard formats would likely complicate things and potentially have a performance impact on everything else. Since you're merely saving a few instruction bytes in rarely occuring code, I fear it's just not worth it.
Regarding the "voting" instruction(s) I was thinking of the bakery algorithm, implemented centrally in hardware so it has a bounded latency. Anyway, I haven't really thought that through, but I'm quite sure that as the number of cores increases we need all the hardware assistance we can get to maximize the scaling efficiency. In my humble opinion that's going to be way more important than shaving off a byte or a cycle here and there.