<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic No, I didn't count &amp;quot;dec rax&amp;quot;  in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/IACA-gets-memory-read-latency-wrong/m-p/1117251#M6141</link>
    <description>&lt;P&gt;No, I didn't count "dec rax" (or the jne) because IACA isn't just adding up the latencies of all the instructions (that would be pretty pointless). IACA is looking for the bottleneck in the range of instructions marked, as if they were a loop. The bottleneck may be a loop carried dependency (as it is here), pressure on a particular port, or an overall limit such as the 4 fused-domain uop limit in the front-end, or other ones I haven't seen.&lt;/P&gt;

&lt;P&gt;Here, only the pointer chase through [rdx] is part of the critical path (notice the "CP" indicator in the instruction output), and so only its latency is used for the throughput calculation. Through the magic of superscalar execution, both the dec and the jne are executed in parallel and in the "shadow" of the other instruction, and hence come "for free" (in fact, they are macro-fused, so only take one uop together, as indicated in the output).&lt;/P&gt;

&lt;P&gt;You could, of course, move the IACA_END marker to just look at the mov, but the result is exactly the same (incorrectly reporting 5 cycles per access):&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - iaca-test
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 5.00 Cycles       Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.0    0.0  | 0.0  | 0.5    0.5  | 0.5    0.5  | 0.0  | 0.0  | 0.0  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | mov rdx, qword ptr [rdx]
Total Num Of Uops: 1
&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 18 Oct 2016 23:55:58 GMT</pubDate>
    <dc:creator>Travis_D_</dc:creator>
    <dc:date>2016-10-18T23:55:58Z</dc:date>
    <item>
      <title>IACA gets memory read latency wrong?</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/IACA-gets-memory-read-latency-wrong/m-p/1117249#M6139</link>
      <description>&lt;P&gt;Let's look at a &lt;EM&gt;very &lt;/EM&gt;simple pointer chasing loop in IACA:&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;.loop:
	IACA_START
	mov rdx, [rdx]
	dec rax
	jne .loop
	IACA_END&lt;/PRE&gt;

&lt;P&gt;This loop does nothing but chase the pointer in rdx and count down rax. Obviously this loop is limited by the latency of the mov rdx, [rdx] instruction, and we know from ample documentation from Intel and elsewhere that a simple read (i.e., not using complex addressing, and not using SEE/AVX regs) has a latency of 4 cycles. Finally, we can simply test it (for 1e9 iterations):&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt; Performance counter stats for './iaca-test':

     3,000,000,363      instructions:u            #    0.75  insns per cycle        
     4,007,163,191      cycles:u                                                    
     1,000,000,358      branches:u                                 &lt;/PRE&gt;

&lt;P&gt;So 4 cycles per as we already expected.&lt;/P&gt;

&lt;P&gt;Yet IACA gets it wrong. For the same binary:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;Throughput Analysis Report
--------------------------
Block Throughput: 5.00 Cycles       Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.5    0.0  | 0.0  | 0.5    0.5  | 0.5    0.5  | 0.0  | 0.0  | 0.5  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | mov rdx, qword ptr [rdx]
|   1    | 0.5       |     |           |           |     |     | 0.5 |     |    | dec rax
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xfffffffffffffff2
Total Num Of Uops: 2&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;IACA seems to thing the reads have a latency of 5. What's up with that? The -analysis LATENCY option makes the same mistake.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 18 Oct 2016 00:45:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/IACA-gets-memory-read-latency-wrong/m-p/1117249#M6139</guid>
      <dc:creator>Travis_D_</dc:creator>
      <dc:date>2016-10-18T00:45:51Z</dc:date>
    </item>
    <item>
      <title>aren't you counting "dec rax"</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/IACA-gets-memory-read-latency-wrong/m-p/1117250#M6140</link>
      <description>&lt;P&gt;aren't you counting "dec rax" as well?&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Shouldn't you put "IACA_END" directly after "mov" if you want to analyze only that loc?&lt;/P&gt;</description>
      <pubDate>Tue, 18 Oct 2016 04:59:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/IACA-gets-memory-read-latency-wrong/m-p/1117250#M6140</guid>
      <dc:creator>Matthias_H_Intel</dc:creator>
      <dc:date>2016-10-18T04:59:23Z</dc:date>
    </item>
    <item>
      <title>No, I didn't count "dec rax"</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/IACA-gets-memory-read-latency-wrong/m-p/1117251#M6141</link>
      <description>&lt;P&gt;No, I didn't count "dec rax" (or the jne) because IACA isn't just adding up the latencies of all the instructions (that would be pretty pointless). IACA is looking for the bottleneck in the range of instructions marked, as if they were a loop. The bottleneck may be a loop carried dependency (as it is here), pressure on a particular port, or an overall limit such as the 4 fused-domain uop limit in the front-end, or other ones I haven't seen.&lt;/P&gt;

&lt;P&gt;Here, only the pointer chase through [rdx] is part of the critical path (notice the "CP" indicator in the instruction output), and so only its latency is used for the throughput calculation. Through the magic of superscalar execution, both the dec and the jne are executed in parallel and in the "shadow" of the other instruction, and hence come "for free" (in fact, they are macro-fused, so only take one uop together, as indicated in the output).&lt;/P&gt;

&lt;P&gt;You could, of course, move the IACA_END marker to just look at the mov, but the result is exactly the same (incorrectly reporting 5 cycles per access):&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - iaca-test
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 5.00 Cycles       Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.0    0.0  | 0.0  | 0.5    0.5  | 0.5    0.5  | 0.0  | 0.0  | 0.0  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | mov rdx, qword ptr [rdx]
Total Num Of Uops: 1
&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 18 Oct 2016 23:55:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/IACA-gets-memory-read-latency-wrong/m-p/1117251#M6141</guid>
      <dc:creator>Travis_D_</dc:creator>
      <dc:date>2016-10-18T23:55:58Z</dc:date>
    </item>
  </channel>
</rss>

