- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let's look at a very simple pointer chasing loop in IACA:
.loop: IACA_START mov rdx, [rdx] dec rax jne .loop IACA_END
This loop does nothing but chase the pointer in rdx and count down rax. Obviously this loop is limited by the latency of the mov rdx, [rdx] instruction, and we know from ample documentation from Intel and elsewhere that a simple read (i.e., not using complex addressing, and not using SEE/AVX regs) has a latency of 4 cycles. Finally, we can simply test it (for 1e9 iterations):
Performance counter stats for './iaca-test': 3,000,000,363 instructions:u # 0.75 insns per cycle 4,007,163,191 cycles:u 1,000,000,358 branches:u
So 4 cycles per as we already expected.
Yet IACA gets it wrong. For the same binary:
Throughput Analysis Report -------------------------- Block Throughput: 5.00 Cycles Throughput Bottleneck: InterIteration Port Binding In Cycles Per Iteration: --------------------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | --------------------------------------------------------------------------------------- | Cycles | 0.5 0.0 | 0.0 | 0.5 0.5 | 0.5 0.5 | 0.0 | 0.0 | 0.5 | 0.0 | --------------------------------------------------------------------------------------- N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | --------------------------------------------------------------------------------- | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | CP | mov rdx, qword ptr [rdx] | 1 | 0.5 | | | | | | 0.5 | | | dec rax | 0F | | | | | | | | | | jnz 0xfffffffffffffff2 Total Num Of Uops: 2
IACA seems to thing the reads have a latency of 5. What's up with that? The -analysis LATENCY option makes the same mistake.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
aren't you counting "dec rax" as well?
Shouldn't you put "IACA_END" directly after "mov" if you want to analyze only that loc?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No, I didn't count "dec rax" (or the jne) because IACA isn't just adding up the latencies of all the instructions (that would be pretty pointless). IACA is looking for the bottleneck in the range of instructions marked, as if they were a loop. The bottleneck may be a loop carried dependency (as it is here), pressure on a particular port, or an overall limit such as the 4 fused-domain uop limit in the front-end, or other ones I haven't seen.
Here, only the pointer chase through [rdx] is part of the critical path (notice the "CP" indicator in the instruction output), and so only its latency is used for the throughput calculation. Through the magic of superscalar execution, both the dec and the jne are executed in parallel and in the "shadow" of the other instruction, and hence come "for free" (in fact, they are macro-fused, so only take one uop together, as indicated in the output).
You could, of course, move the IACA_END marker to just look at the mov, but the result is exactly the same (incorrectly reporting 5 cycles per access):
Intel(R) Architecture Code Analyzer Version - 2.1 Analyzed File - iaca-test Binary Format - 64Bit Architecture - HSW Analysis Type - Throughput Throughput Analysis Report -------------------------- Block Throughput: 5.00 Cycles Throughput Bottleneck: InterIteration Port Binding In Cycles Per Iteration: --------------------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | --------------------------------------------------------------------------------------- | Cycles | 0.0 0.0 | 0.0 | 0.5 0.5 | 0.5 0.5 | 0.0 | 0.0 | 0.0 | 0.0 | --------------------------------------------------------------------------------------- N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | --------------------------------------------------------------------------------- | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | CP | mov rdx, qword ptr [rdx] Total Num Of Uops: 1
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page