Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Chris_S_3
Beginner
70 Views

µops and nops and LCPs

This question is with respect to Sandy Bridge, Haswell, .... Intel microarchitectures with a µop cache.

Since the pre-decode unit fetches 16 byte blocks, NOPs are necessary for alignment purposes. It is better for basic blocks to start at a 16 byte address and it is better for instructions to not overlap 16 byte boundaries. But NOPs consume resources (Optimization Manual 3.5.1.10). For example, XCHG EAX is decoded and saved as a µop in the µop cache. It is then eventually scheduled and retired.

So it would seem that avoiding NOPs would be a worthy goal for code living in the µop cache. Less execution port pressure, ... Less is less.

Length changing prefixes (LCPs) can serve a similar alignment purpose. Indeed NOPs and LCPs are combined. However LCPs (not REX) suffer a penalty (3 cycle+)  in the decode units (3.4.2.3). It would seem that for code living in the µop cache once that LCP stall penalty has been paid in full, that the savings would be one less NOP µop.

But then 3.4.2.3 also says:

If the LCP stall happens in a tight loop, it can cause significant performance degradation

At this point my mental model is getting a headache. Still this could be a pre-Sandy Bridge admonition. After all it is assuming that an LCP stall is happening and this should not be the case for coding living in the µop cache.

Q1: for code living in the µop cache, loops etc, is it better to avoid an aligning NOP in favor an LCP for Sandy Bridge+?

Q2: what happens with aligning nops which come after an uncondtional branch? Do they end up in the µop cache consuming resources? Also, I see a lot of  66666690 alignment code. Does this LCP NOP suffer unnecessary LCP stalls?

BTW I'm aware of the general recommendations against LCPs but I'm wondering if anything changed with Sandy Bridge.

0 Kudos
2 Replies
Chris_S_3
Beginner
70 Views

Well now, that didn't work AT ALL like I'd expected.

extern void    nop_test();

int
main(int argc, char **argv)
{
    nop_test();
}

global _nop_test
_nop_test:
    mov        RAX,1
    sal        RAX,36
    align    16
loop:
    db        66h
    movzx    EBP,BH
    shr        RBX,8
    lea        EBP,[EDX+8*EBP]
    sub        RAX,1
    align    16
    jnz        loop
    ret

On my Haswell Macbook Pro, with the 66H LCP this takes 1m6.817s and with a nop it takes 0m37.114s.

0000000000000010    660fb6ef            movzbw    %bh, %bp
0000000000000014    48c1eb08            shrq    $0x8, %rbx
0000000000000018    678d2cea            leal    (%edx,%ebp,8), %ebp
000000000000001c    4883e801            subq    $0x1, %rax
0000000000000020    75ee                jne    0x10

vs

0000000000000010    0fb6ef              movzbl    %bh, %ebp
0000000000000013    48c1eb08            shrq    $0x8, %rbx
0000000000000017    678d2cea            leal    (%edx,%ebp,8), %ebp
000000000000001b    4883e801            subq    $0x1, %rax
000000000000001f    90                  nop
0000000000000020    75ee                jne    0x10

Chris_S_3
Beginner
70 Views

But address size change (67H) for a register instruction seems to work: 0m3.248s vs 4.512s  Any explanations or caveats?

%use            smartalign
ALIGNMODE        generic,16
BITS            64

section            .text

global _nop_test
_nop_test:
    mov        RAX,1
    sal        RAX,31
    mov        RBX,0
    mov        EDX,0
    align    16
loop:
    db        67h
    movzx    EBP,BH
    sal        RBX,8
    lea        EBP,[EDX+8*EBP]
    sub        RAX,1
    align    16
    jnz        loop
    ret

loop:
0000000000000020    670fb6ef            movzbl    %bh, %ebp
0000000000000024    48c1e308            shlq    $0x8, %rbx
0000000000000028    678d2cea            leal    (%edx,%ebp,8), %ebp
000000000000002c    4883e801            subq    $0x1, %rax
0000000000000030    75ee                jne    0x20

Reply