topic No, I didn't count "dec rax" in Software Tuning, Performance Optimization & Platform Monitoring

IACA gets memory read latency wrong?

Travis_D_ — Tue, 18 Oct 2016 00:45:51 GMT

Let's look at a very simple pointer chasing loop in IACA:

.loop:
	IACA_START
	mov rdx, [rdx]
	dec rax
	jne .loop
	IACA_END

This loop does nothing but chase the pointer in rdx and count down rax. Obviously this loop is limited by the latency of the mov rdx, [rdx] instruction, and we know from ample documentation from Intel and elsewhere that a simple read (i.e., not using complex addressing, and not using SEE/AVX regs) has a latency of 4 cycles. Finally, we can simply test it (for 1e9 iterations):

 Performance counter stats for './iaca-test':

     3,000,000,363      instructions:u            #    0.75  insns per cycle        
     4,007,163,191      cycles:u                                                    
     1,000,000,358      branches:u

So 4 cycles per as we already expected.

Yet IACA gets it wrong. For the same binary:

Throughput Analysis Report
--------------------------
Block Throughput: 5.00 Cycles       Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.5    0.0  | 0.0  | 0.5    0.5  | 0.5    0.5  | 0.0  | 0.0  | 0.5  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | mov rdx, qword ptr [rdx]
|   1    | 0.5       |     |           |           |     |     | 0.5 |     |    | dec rax
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xfffffffffffffff2
Total Num Of Uops: 2

IACA seems to thing the reads have a latency of 5. What's up with that? The -analysis LATENCY option makes the same mistake.

aren't you counting "dec rax"

Matthias_H_Intel — Tue, 18 Oct 2016 04:59:23 GMT

aren't you counting "dec rax" as well?

Shouldn't you put "IACA_END" directly after "mov" if you want to analyze only that loc?

No, I didn't count "dec rax"

Travis_D_ — Tue, 18 Oct 2016 23:55:58 GMT

No, I didn't count "dec rax" (or the jne) because IACA isn't just adding up the latencies of all the instructions (that would be pretty pointless). IACA is looking for the bottleneck in the range of instructions marked, as if they were a loop. The bottleneck may be a loop carried dependency (as it is here), pressure on a particular port, or an overall limit such as the 4 fused-domain uop limit in the front-end, or other ones I haven't seen.

Here, only the pointer chase through [rdx] is part of the critical path (notice the "CP" indicator in the instruction output), and so only its latency is used for the throughput calculation. Through the magic of superscalar execution, both the dec and the jne are executed in parallel and in the "shadow" of the other instruction, and hence come "for free" (in fact, they are macro-fused, so only take one uop together, as indicated in the output).

You could, of course, move the IACA_END marker to just look at the mov, but the result is exactly the same (incorrectly reporting 5 cycles per access):

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - iaca-test
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 5.00 Cycles       Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.0    0.0  | 0.0  | 0.5    0.5  | 0.5    0.5  | 0.0  | 0.0  | 0.0  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | mov rdx, qword ptr [rdx]
Total Num Of Uops: 1