Micro-Op Retirement Constraint, IACA Prediction

JJoha8 · ‎12-08-2016

Hi,

during my endeavours to get the L1 cache to deliver at its peak on Broadwell-EP, I came across a discrepancy between documentation and IACA.
The Architectures Optimization Reference Manual says that there is a four micro-op per cycle retirement limit for SNB. While it mentions that decoder throughput etc. has been improved over time to 6 uop/c in SKY, it says nowhere that the retirement limit has been lifted as well. Most of my measurements are in line with this theory. There is one measurement however during which the core seems to be able to retire more than four micro-ops per cycle.

I'm using a standard STREAM triad to measure L1 bandwidth. My code is AVX vectorized and each loop does eight AVX updates, i.e., it processes 64 elements. For the naive code, IACA predicts 12 cycles for each loop, which is correct:

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - L1_stream_triad_avx.o
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 12.00 Cycles       Throughput Bottleneck: PORT2_AGU, PORT3_AGU

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 4.0    0.0  | 4.0  | 12.0   8.0  | 12.0   8.0  | 8.0  | 1.0  | 1.0  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovapd ymm0, ymmword ptr [rsi+rax*8]
|   2    | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231pd ymm0, ymm8, ymmword ptr [rdx+rax*8]
|   2    |           |     | 1.0       |           | 1.0 |     |     |     | CP | vmovapd ymmword ptr [rcx+rax*8], ymm0
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovapd ymm1, ymmword ptr [rsi+rax*8+0x20]
|   2    |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231pd ymm1, ymm8, ymmword ptr [rdx+rax*8+0x20]
|   2    |           |     |           | 1.0       | 1.0 |     |     |     | CP | vmovapd ymmword ptr [rcx+rax*8+0x20], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovapd ymm2, ymmword ptr [rsi+rax*8+0x40]
|   2    | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231pd ymm2, ymm8, ymmword ptr [rdx+rax*8+0x40]
|   2    |           |     | 1.0       |           | 1.0 |     |     |     | CP | vmovapd ymmword ptr [rcx+rax*8+0x40], ymm2
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovapd ymm3, ymmword ptr [rsi+rax*8+0x60]
|   2    |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231pd ymm3, ymm8, ymmword ptr [rdx+rax*8+0x60]
|   2    |           |     |           | 1.0       | 1.0 |     |     |     | CP | vmovapd ymmword ptr [rcx+rax*8+0x60], ymm3
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovapd ymm4, ymmword ptr [rsi+rax*8+0x80]
|   2    | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231pd ymm4, ymm8, ymmword ptr [rdx+rax*8+0x80]
|   2    |           |     | 1.0       |           | 1.0 |     |     |     | CP | vmovapd ymmword ptr [rcx+rax*8+0x80], ymm4
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovapd ymm5, ymmword ptr [rsi+rax*8+0xa0]
|   2    |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231pd ymm5, ymm8, ymmword ptr [rdx+rax*8+0xa0]
|   2    |           |     |           | 1.0       | 1.0 |     |     |     | CP | vmovapd ymmword ptr [rcx+rax*8+0xa0], ymm5
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovapd ymm6, ymmword ptr [rsi+rax*8+0xc0]
|   2    | 1.0       |     |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231pd ymm6, ymm8, ymmword ptr [rdx+rax*8+0xc0]
|   2    |           |     | 1.0       |           | 1.0 |     |     |     | CP | vmovapd ymmword ptr [rcx+rax*8+0xc0], ymm6
|   1    |           |     | 1.0   1.0 |           |     |     |     |     | CP | vmovapd ymm7, ymmword ptr [rsi+rax*8+0xe0]
|   2    |           | 1.0 |           | 1.0   1.0 |     |     |     |     | CP | vfmadd231pd ymm7, ymm8, ymmword ptr [rdx+rax*8+0xe0]
|   2    |           |     |           | 1.0       | 1.0 |     |     |     | CP | vmovapd ymmword ptr [rcx+rax*8+0xe0], ymm7
|   1    |           |     |           |           |     | 1.0 |     |     |    | add rax, 0x20
|   1    |           |     |           |           |     |     | 1.0 |     |    | cmp rax, rdi
|   0F   |           |     |           |           |     |     |     |     |    | jl 0xffffffffffffff38
Total Num Of Uops: 42

There's a total of 8x3=24 load/store operations and only two AGUs capable of doing complex addressing so 24/2=12 cycles per loop. Measured value is 12.4 cycles per loop.

If I'm using a fast LEA the prediction goes down, because the AGUs are no longer the bottleneck:

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - L1_stream_triad_avx_lea.o
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 8.75 Cycles       Throughput Bottleneck: FrontEnd

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 4.0    0.0  | 4.0  | 8.0    8.0  | 8.0    8.0  | 8.0  | 1.5  | 1.5  | 8.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     |           |           |     | 1.0 |     |     |    | lea rbx, ptr [rcx+rax*8]
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovapd ymm0, ymmword ptr [rsi+rax*8]
|   2    |           | 1.0 |           | 1.0   1.0 |     |     |     |     |    | vfmadd231pd ymm0, ymm8, ymmword ptr [rdx+rax*8]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rbx], ymm0
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovapd ymm1, ymmword ptr [rsi+rax*8+0x20]
|   2    | 0.9       | 0.1 |           | 1.0   1.0 |     |     |     |     |    | vfmadd231pd ymm1, ymm8, ymmword ptr [rdx+rax*8+0x20]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rbx+0x20], ymm1
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovapd ymm2, ymmword ptr [rsi+rax*8+0x40]
|   2    | 0.1       | 0.9 |           | 1.0   1.0 |     |     |     |     |    | vfmadd231pd ymm2, ymm8, ymmword ptr [rdx+rax*8+0x40]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rbx+0x40], ymm2
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovapd ymm3, ymmword ptr [rsi+rax*8+0x60]
|   2    | 0.9       | 0.1 |           | 1.0   1.0 |     |     |     |     |    | vfmadd231pd ymm3, ymm8, ymmword ptr [rdx+rax*8+0x60]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rbx+0x60], ymm3
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovapd ymm4, ymmword ptr [rsi+rax*8+0x80]
|   2    | 0.1       | 0.9 |           | 1.0   1.0 |     |     |     |     |    | vfmadd231pd ymm4, ymm8, ymmword ptr [rdx+rax*8+0x80]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rbx+0x80], ymm4
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovapd ymm5, ymmword ptr [rsi+rax*8+0xa0]
|   2    | 1.0       |     |           | 1.0   1.0 |     |     |     |     |    | vfmadd231pd ymm5, ymm8, ymmword ptr [rdx+rax*8+0xa0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rbx+0xa0], ymm5
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovapd ymm6, ymmword ptr [rsi+rax*8+0xc0]
|   2    |           | 1.0 |           | 1.0   1.0 |     |     |     |     |    | vfmadd231pd ymm6, ymm8, ymmword ptr [rdx+rax*8+0xc0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rbx+0xc0], ymm6
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovapd ymm7, ymmword ptr [rsi+rax*8+0xe0]
|   2    | 1.0       |     |           | 1.0   1.0 |     |     |     |     |    | vfmadd231pd ymm7, ymm8, ymmword ptr [rdx+rax*8+0xe0]
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rbx+0xe0], ymm7
|   1    |           |     |           |           |     | 0.5 | 0.5 |     |    | add rax, 0x20
|   1    |           |     |           |           |     |     | 1.0 |     |    | cmp rax, rdi
|   0F   |           |     |           |           |     |     |     |     |    | jl 0xffffffffffffff3c
Total Num Of Uops: 43

IACA lists the FrontEnd as the Bottleneck. 8x(1 (AVX load) + 1 (AVX load) + 1 (AVX FMA) + 2 (AVX STORE) ) + 1 (lea) + 1 (add) + 1 (macro-op fused compare & branch)) = 43 uops. At four uop/c retirement limit that would correspond to 10.75 cycles per loop. IACA lists only 8.75 cycles. At first I thought maybe IACA is wrong, but the measured value is 9.76 cycles per loop. This value is much better that the light speed estimate of 10.75 cycles that comes from a four uop/c retirement limit. So apparently, more than four uops can be retired in some cases. Is this true? Under what circumstances can the core retire more than four microops per cycle? Or is something else happening here that I'm not seeing? Why is the measured value off by one cycle from the IACA prediction?

Thanks

JJoha8 · ‎12-08-2016

I've noticed there's micro-op fusion happening when stores are paired with address generations on the new ST AGU port in the case of the lea optimization. Is it correct that in that case a store only counts as one uop? In this case the IACA prediction would be correct and the measured value would be in line with the prediction.