- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This question is with respect to Sandy Bridge, Haswell, .... Intel microarchitectures with a µop cache.
Since the pre-decode unit fetches 16 byte blocks, NOPs are necessary for alignment purposes. It is better for basic blocks to start at a 16 byte address and it is better for instructions to not overlap 16 byte boundaries. But NOPs consume resources (Optimization Manual 3.5.1.10). For example, XCHG EAX is decoded and saved as a µop in the µop cache. It is then eventually scheduled and retired.
So it would seem that avoiding NOPs would be a worthy goal for code living in the µop cache. Less execution port pressure, ... Less is less.
Length changing prefixes (LCPs) can serve a similar alignment purpose. Indeed NOPs and LCPs are combined. However LCPs (not REX) suffer a penalty (3 cycle+) in the decode units (3.4.2.3). It would seem that for code living in the µop cache once that LCP stall penalty has been paid in full, that the savings would be one less NOP µop.
But then 3.4.2.3 also says:
If the LCP stall happens in a tight loop, it can cause significant performance degradation
At this point my mental model is getting a headache. Still this could be a pre-Sandy Bridge admonition. After all it is assuming that an LCP stall is happening and this should not be the case for coding living in the µop cache.
Q1: for code living in the µop cache, loops etc, is it better to avoid an aligning NOP in favor an LCP for Sandy Bridge+?
Q2: what happens with aligning nops which come after an uncondtional branch? Do they end up in the µop cache consuming resources? Also, I see a lot of 66666690 alignment code. Does this LCP NOP suffer unnecessary LCP stalls?
BTW I'm aware of the general recommendations against LCPs but I'm wondering if anything changed with Sandy Bridge.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well now, that didn't work AT ALL like I'd expected.
extern void nop_test();
int
main(int argc, char **argv)
{
nop_test();
}
global _nop_test
_nop_test:
mov RAX,1
sal RAX,36
align 16
loop:
db 66h
movzx EBP,BH
shr RBX,8
lea EBP,[EDX+8*EBP]
sub RAX,1
align 16
jnz loop
ret
On my Haswell Macbook Pro, with the 66H LCP this takes 1m6.817s and with a nop it takes 0m37.114s.
0000000000000010 660fb6ef movzbw %bh, %bp
0000000000000014 48c1eb08 shrq $0x8, %rbx
0000000000000018 678d2cea leal (%edx,%ebp,8), %ebp
000000000000001c 4883e801 subq $0x1, %rax
0000000000000020 75ee jne 0x10
vs
0000000000000010 0fb6ef movzbl %bh, %ebp
0000000000000013 48c1eb08 shrq $0x8, %rbx
0000000000000017 678d2cea leal (%edx,%ebp,8), %ebp
000000000000001b 4883e801 subq $0x1, %rax
000000000000001f 90 nop
0000000000000020 75ee jne 0x10
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But address size change (67H) for a register instruction seems to work: 0m3.248s vs 4.512s Any explanations or caveats?
%use smartalign
ALIGNMODE generic,16
BITS 64section .text
global _nop_test
_nop_test:
mov RAX,1
sal RAX,31
mov RBX,0
mov EDX,0
align 16
loop:
db 67h
movzx EBP,BH
sal RBX,8
lea EBP,[EDX+8*EBP]
sub RAX,1
align 16
jnz loop
ret
loop:
0000000000000020 670fb6ef movzbl %bh, %ebp
0000000000000024 48c1e308 shlq $0x8, %rbx
0000000000000028 678d2cea leal (%edx,%ebp,8), %ebp
000000000000002c 4883e801 subq $0x1, %rax
0000000000000030 75ee jne 0x20

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page