Well, that didn't work AT ALL

Chris_S_3 · ‎04-16-2015

This question is with respect to Sandy Bridge, Haswell, .... Intel microarchitectures with a µop cache.

Since the pre-decode unit fetches 16 byte blocks, NOPs are necessary for alignment purposes. It is better for basic blocks to start at a 16 byte address and it is better for instructions to not overlap 16 byte boundaries. But NOPs consume resources (Optimization Manual 3.5.1.10). For example, XCHG EAX is decoded and saved as a µop in the µop cache. It is then eventually scheduled and retired.

So it would seem that avoiding NOPs would be a worthy goal for code living in the µop cache. Less execution port pressure, ... Less is less.

Length changing prefixes (LCPs) can serve a similar alignment purpose. Indeed NOPs and LCPs are combined. However LCPs (not REX) suffer a penalty (3 cycle+) in the decode units (3.4.2.3). It would seem that for code living in the µop cache once that LCP stall penalty has been paid in full, that the savings would be one less NOP µop.

But then 3.4.2.3 also says:

If the LCP stall happens in a tight loop, it can cause significant performance degradation

At this point my mental model is getting a headache. Still this could be a pre-Sandy Bridge admonition. After all it is assuming that an LCP stall is happening and this should not be the case for coding living in the µop cache.

Q1: for code living in the µop cache, loops etc, is it better to avoid an aligning NOP in favor an LCP for Sandy Bridge+?

Q2: what happens with aligning nops which come after an uncondtional branch? Do they end up in the µop cache consuming resources? Also, I see a lot of 66666690 alignment code. Does this LCP NOP suffer unnecessary LCP stalls?

BTW I'm aware of the general recommendations against LCPs but I'm wondering if anything changed with Sandy Bridge.

Chris_S_3 · ‎04-16-2015

Well now, that didn't work AT ALL like I'd expected.

extern void nop_test();

int
main(int argc, char **argv)
{
nop_test();
}

global _nop_test
_nop_test:
   mov       RAX,1
   sal       RAX,36
   align   16
loop:
   db       66h
   movzx   EBP,BH
   shr       RBX,8
   lea       EBP,[EDX+8*EBP]
   sub       RAX,1
   align   16
   jnz       loop
   ret

On my Haswell Macbook Pro, with the 66H LCP this takes 1m6.817s and with a nop it takes 0m37.114s.

0000000000000010   660fb6ef     movzbw   %bh, %bp
0000000000000014   48c1eb08     shrq   $0x8, %rbx
0000000000000018   678d2cea     leal   (%edx,%ebp,8), %ebp
000000000000001c   4883e801     subq   $0x1, %rax
0000000000000020   75ee     jne   0x10

vs

0000000000000010   0fb6ef     movzbl   %bh, %ebp
0000000000000013   48c1eb08     shrq   $0x8, %rbx
0000000000000017   678d2cea     leal   (%edx,%ebp,8), %ebp
000000000000001b   4883e801     subq   $0x1, %rax
000000000000001f   90     nop
0000000000000020   75ee     jne   0x10

Chris_S_3 · ‎04-16-2015

But address size change (67H) for a register instruction seems to work: 0m3.248s vs 4.512s Any explanations or caveats?

%use           smartalign
ALIGNMODE       generic,16
BITS           64

section           .text

global _nop_test
_nop_test:
   mov       RAX,1
   sal       RAX,31
   mov       RBX,0
   mov       EDX,0
   align   16
loop:
db       67h
   movzx   EBP,BH
   sal       RBX,8
   lea       EBP,[EDX+8*EBP]
   sub       RAX,1
   align   16
   jnz       loop
   ret

loop:
0000000000000020   670fb6ef     movzbl   %bh, %ebp
0000000000000024   48c1e308     shlq   $0x8, %rbx
0000000000000028   678d2cea     leal   (%edx,%ebp,8), %ebp
000000000000002c   4883e801     subq   $0x1, %rax
0000000000000030   75ee     jne   0x20

µops and nops and LCPs