<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Let's consider Intel Core i7 in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bound-characterization-on-Ivy-Bridge/m-p/961531#M2502</link>
    <description>Let's consider &lt;STRONG&gt;Intel Core i7-3840QM&lt;/STRONG&gt; ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ):

Size of &lt;STRONG&gt;L3&lt;/STRONG&gt; Cache = 8MB ( shared between all cores for data &amp;amp; instructions )
Size of &lt;STRONG&gt;L2&lt;/STRONG&gt; Cache = 1MB ( 256KB per core / shared for data &amp;amp; instructions )
Size of &lt;STRONG&gt;L1&lt;/STRONG&gt; Cache = 256KB ( 32KB per core for data &amp;amp; 32KB per core for instructions )

&amp;gt;&amp;gt;#define N 1024
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;double A&lt;N&gt;&lt;N&gt;, B&lt;N&gt;&lt;N&gt;, C&lt;N&gt;&lt;N&gt;;
&amp;gt;&amp;gt;...

It means, that:

- Size of &lt;STRONG&gt;A&lt;/STRONG&gt; matrix is 4MB ( 4,194,304 bytes )
- Size of &lt;STRONG&gt;B&lt;/STRONG&gt; matrix is 4MB ( 4,194,304 bytes )
- Size of &lt;STRONG&gt;C&lt;/STRONG&gt; matrix is 4MB ( 4,194,304 bytes )

As you can see only two matricies, for example &lt;STRONG&gt;A&lt;/STRONG&gt; and &lt;STRONG&gt;B&lt;/STRONG&gt;, could "fit" into &lt;STRONG&gt;L3&lt;/STRONG&gt; Cache at the same time in the best case (!). But, the "core" of your calculations uses &lt;STRONG&gt;C&lt;/STRONG&gt; matrix as well:
...
C&lt;I&gt;&lt;J&gt; += A&lt;I&gt;&lt;K&gt; * B&lt;K&gt;&lt;J&gt;;
...
and I think it creates a problem you're observing.

It is Not clear why you got a negative number for &lt;STRONG&gt;L2&lt;/STRONG&gt; Bound. Of course, none of these matricies could "fit" into &lt;STRONG&gt;L2&lt;/STRONG&gt; and &lt;STRONG&gt;L1&lt;/STRONG&gt; Caches.

By the way, that is why &lt;STRONG&gt;Loop-Blocking&lt;/STRONG&gt; optimization technique is recommended in such cases and it is described in the manual.&lt;/J&gt;&lt;/K&gt;&lt;/K&gt;&lt;/I&gt;&lt;/J&gt;&lt;/I&gt;&lt;/N&gt;&lt;/N&gt;&lt;/N&gt;&lt;/N&gt;&lt;/N&gt;&lt;/N&gt;</description>
    <pubDate>Tue, 05 Mar 2013 05:26:00 GMT</pubDate>
    <dc:creator>SergeyKostrov</dc:creator>
    <dc:date>2013-03-05T05:26:00Z</dc:date>
    <item>
      <title>Memory bound characterization on Ivy Bridge</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bound-characterization-on-Ivy-Bridge/m-p/961530#M2501</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;
&lt;P&gt;I found it confusing when I tried to characterize the memory bound on Ivy Bridge as it is mentioned in the&amp;nbsp;&lt;I&gt;Intel 64 and IA-32 Architectures Optimization Reference Manual Appendix B.3.2.3&lt;/I&gt;, that I got larger number on STALLS_L2_PENDING than STALLS_L1D_PENDING.&amp;nbsp;Consequently, If I do the calculation for &lt;EM&gt;%L2 Bound&lt;/EM&gt; as the manual tells, I will get &lt;STRONG&gt;negative number for &lt;EM&gt;%L2 Bound&lt;/EM&gt;.&lt;/STRONG&gt; Could anyone help me with this please?&lt;/P&gt;
&lt;P&gt;This it the code segment I tried to characterize:&lt;/P&gt;
&lt;P&gt;#define N 1024&lt;/P&gt;
&lt;P&gt;double A&lt;N&gt;&lt;N&gt;, B&lt;N&gt;&lt;N&gt;, C&lt;N&gt;&lt;N&gt;;&lt;/N&gt;&lt;/N&gt;&lt;/N&gt;&lt;/N&gt;&lt;/N&gt;&lt;/N&gt;&lt;/P&gt;
&lt;P&gt;void code_to_monitor() {&lt;BR /&gt;&amp;nbsp; int i, j, k;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; for (i = 0; i &amp;lt; N; i++) {&lt;BR /&gt;&amp;nbsp; &amp;nbsp; for (j = 0; j &amp;lt; N; j++) {&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; A&lt;I&gt;&lt;J&gt; = B&lt;I&gt;&lt;J&gt; = i + j;&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; C&lt;I&gt;&lt;J&gt; = 0.0;&lt;BR /&gt;&amp;nbsp; &amp;nbsp; }&lt;BR /&gt;&amp;nbsp; }&lt;/J&gt;&lt;/I&gt;&lt;/J&gt;&lt;/I&gt;&lt;/J&gt;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; for (i = 0; i &amp;lt; N; i++) {&lt;BR /&gt;&amp;nbsp; &amp;nbsp; for (j = 0; j &amp;lt; N; j++) {&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; for (k = 0; k &amp;lt; N; k++) {&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; C&lt;I&gt;&lt;J&gt; += A&lt;I&gt;&lt;K&gt; * B&lt;K&gt;&lt;J&gt;;&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; }&lt;BR /&gt;&amp;nbsp; &amp;nbsp; }&lt;BR /&gt;&amp;nbsp; }&lt;BR /&gt;}&lt;/J&gt;&lt;/K&gt;&lt;/K&gt;&lt;/I&gt;&lt;/J&gt;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;And these are the numbers I got from the experiments.&lt;/P&gt;
&lt;P&gt;CYCLE_ACTIVITY:STALLS_LDM_PENDING : 25129701285&lt;BR /&gt;CYCLE_ACTIVITY:STALLS_L1D_PENDING : 22822968083&lt;BR /&gt;CYCLE_ACTIVITY:STALLS_L2_PENDING : 24375543727&lt;BR /&gt;TOTAL CYCLES: 43885183166&lt;/P&gt;</description>
      <pubDate>Mon, 04 Mar 2013 22:56:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bound-characterization-on-Ivy-Bridge/m-p/961530#M2501</guid>
      <dc:creator>Yunqi_Z_</dc:creator>
      <dc:date>2013-03-04T22:56:21Z</dc:date>
    </item>
    <item>
      <title>Let's consider Intel Core i7</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bound-characterization-on-Ivy-Bridge/m-p/961531#M2502</link>
      <description>Let's consider &lt;STRONG&gt;Intel Core i7-3840QM&lt;/STRONG&gt; ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ):

Size of &lt;STRONG&gt;L3&lt;/STRONG&gt; Cache = 8MB ( shared between all cores for data &amp;amp; instructions )
Size of &lt;STRONG&gt;L2&lt;/STRONG&gt; Cache = 1MB ( 256KB per core / shared for data &amp;amp; instructions )
Size of &lt;STRONG&gt;L1&lt;/STRONG&gt; Cache = 256KB ( 32KB per core for data &amp;amp; 32KB per core for instructions )

&amp;gt;&amp;gt;#define N 1024
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;double A&lt;N&gt;&lt;N&gt;, B&lt;N&gt;&lt;N&gt;, C&lt;N&gt;&lt;N&gt;;
&amp;gt;&amp;gt;...

It means, that:

- Size of &lt;STRONG&gt;A&lt;/STRONG&gt; matrix is 4MB ( 4,194,304 bytes )
- Size of &lt;STRONG&gt;B&lt;/STRONG&gt; matrix is 4MB ( 4,194,304 bytes )
- Size of &lt;STRONG&gt;C&lt;/STRONG&gt; matrix is 4MB ( 4,194,304 bytes )

As you can see only two matricies, for example &lt;STRONG&gt;A&lt;/STRONG&gt; and &lt;STRONG&gt;B&lt;/STRONG&gt;, could "fit" into &lt;STRONG&gt;L3&lt;/STRONG&gt; Cache at the same time in the best case (!). But, the "core" of your calculations uses &lt;STRONG&gt;C&lt;/STRONG&gt; matrix as well:
...
C&lt;I&gt;&lt;J&gt; += A&lt;I&gt;&lt;K&gt; * B&lt;K&gt;&lt;J&gt;;
...
and I think it creates a problem you're observing.

It is Not clear why you got a negative number for &lt;STRONG&gt;L2&lt;/STRONG&gt; Bound. Of course, none of these matricies could "fit" into &lt;STRONG&gt;L2&lt;/STRONG&gt; and &lt;STRONG&gt;L1&lt;/STRONG&gt; Caches.

By the way, that is why &lt;STRONG&gt;Loop-Blocking&lt;/STRONG&gt; optimization technique is recommended in such cases and it is described in the manual.&lt;/J&gt;&lt;/K&gt;&lt;/K&gt;&lt;/I&gt;&lt;/J&gt;&lt;/I&gt;&lt;/N&gt;&lt;/N&gt;&lt;/N&gt;&lt;/N&gt;&lt;/N&gt;&lt;/N&gt;</description>
      <pubDate>Tue, 05 Mar 2013 05:26:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bound-characterization-on-Ivy-Bridge/m-p/961531#M2502</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-03-05T05:26:00Z</dc:date>
    </item>
    <item>
      <title>Hi Sergey,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bound-characterization-on-Ivy-Bridge/m-p/961532#M2503</link>
      <description>&lt;P&gt;Hi Sergey,&lt;/P&gt;
&lt;P&gt;Thanks for you reply. Actually I'm not trying to optimize the matrix multiplication, it's just a piece of sample code to check if the memorgy bound characterization work well which had given me negative numbers.&lt;/P&gt;</description>
      <pubDate>Tue, 05 Mar 2013 05:32:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bound-characterization-on-Ivy-Bridge/m-p/961532#M2503</guid>
      <dc:creator>Yunqi_Z_</dc:creator>
      <dc:date>2013-03-05T05:32:54Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt; I will get negative</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bound-characterization-on-Ivy-Bridge/m-p/961533#M2504</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;gt;&amp;nbsp;I will get negative number for &lt;EM&gt;%L2 Bound&amp;gt;&amp;gt;&amp;gt;&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;This is the&amp;nbsp;formula used to calculate %L2 Bound : (CYCLE_ACTIVITY:STALLS_L1D_PENDING - CYCLE_ACTIVITY:STALLS_L2_PENDING) / CLOCKS&lt;/P&gt;
&lt;P&gt;Now by looking at the formula values STALLS_L1D_PENDING is less than STALLS_L2D_PENDING so you are getting a negative result.&lt;/P&gt;</description>
      <pubDate>Tue, 05 Mar 2013 06:58:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bound-characterization-on-Ivy-Bridge/m-p/961533#M2504</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-03-05T06:58:57Z</dc:date>
    </item>
    <item>
      <title>Yes, iliyapolak. That's why I</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bound-characterization-on-Ivy-Bridge/m-p/961534#M2505</link>
      <description>&lt;P&gt;Yes, iliyapolak. That's why I'm confused.&lt;/P&gt;</description>
      <pubDate>Tue, 05 Mar 2013 08:44:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bound-characterization-on-Ivy-Bridge/m-p/961534#M2505</guid>
      <dc:creator>Yunqi_Z_</dc:creator>
      <dc:date>2013-03-05T08:44:54Z</dc:date>
    </item>
    <item>
      <title>Maybe it should be this way.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bound-characterization-on-Ivy-Bridge/m-p/961535#M2506</link>
      <description>&lt;P&gt;Maybe it should be this way.&lt;/P&gt;</description>
      <pubDate>Tue, 05 Mar 2013 11:16:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bound-characterization-on-Ivy-Bridge/m-p/961535#M2506</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-03-05T11:16:04Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...Thanks for you reply.</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bound-characterization-on-Ivy-Bridge/m-p/961536#M2507</link>
      <description>&amp;gt;&amp;gt;...Thanks for you reply. Actually I'm not trying to optimize the matrix multiplication...

I understood this and I think Intel software engineers should review that formula in the Intel 64 and IA-32 Architectures Optimization Reference manual.</description>
      <pubDate>Tue, 05 Mar 2013 13:44:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bound-characterization-on-Ivy-Bridge/m-p/961536#M2507</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-03-05T13:44:28Z</dc:date>
    </item>
    <item>
      <title>The issue here is you have 3</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bound-characterization-on-Ivy-Bridge/m-p/961537#M2508</link>
      <description>&lt;P&gt;The issue here is you have 3 arrays all coming from different cache locations. &amp;nbsp;B is definitely in the L3, it attenuates the associativity of the L1 and the L2, no way it can fit into the L1 or L2. &amp;nbsp;C is coming from L1, if not there then the L2, it is sequentially accessed and you'd have to evict every set it fits into the cache to find it in the L2, possible but unlikely. &amp;nbsp;A is partly in the L1 and the rest is in the L2, it's reused over and over and accessed sequentially. &amp;nbsp;&lt;/P&gt;
&lt;P&gt;These pending stats are only partially accurate in my experience. &amp;nbsp;If I want to know where I'm bound I measure the hw pref activity from the L1 as well as all the L2 stats which tell me about I-cache, L1D and HW pref activity. &amp;nbsp;You'll know then if you're L2 bound, and you might measure the demand request stream from the L3, just to get an idea if they're not getting serviced by the L2 hw pref and making their way to the L3. &amp;nbsp;Still, SB can deliver 2.5 upc operating out of it's L3 with 40-50 requests per thousand getting there (though this is with the HW pref picking up on that pattern). &amp;nbsp;Problem for you is B is striding by 8192 B and the HW pref don't handle that pattern, so you're demand req are definitely getting to the L3. &amp;nbsp;Every 8 iterations on the K loop you need to fetch 1024 cachelines from the L3.&lt;/P&gt;
&lt;P&gt;perfwise&lt;/P&gt;</description>
      <pubDate>Wed, 06 Mar 2013 12:45:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-bound-characterization-on-Ivy-Bridge/m-p/961537#M2508</guid>
      <dc:creator>perfwise</dc:creator>
      <dc:date>2013-03-06T12:45:35Z</dc:date>
    </item>
  </channel>
</rss>

