<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic HitM I believe stands for hit in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Shared-vs-Unshared-L3-hits/m-p/1052595#M4929</link>
    <description>&lt;P&gt;HitM I believe stands for hit modified; that is, another core owns a modified copy of the cache line.&lt;/P&gt;

&lt;P&gt;Threads on different sockets sharing cache lines is not effective.&amp;nbsp; However, BIOS upgrades during the time I was testing IvyTown made a significant improvement.&lt;/P&gt;

&lt;P&gt;I doubt you will be able to test this with any rigor unless you can set affinity and avoid dynamic scheduling.&lt;/P&gt;</description>
    <pubDate>Mon, 30 Jun 2014 23:34:32 GMT</pubDate>
    <dc:creator>TimP</dc:creator>
    <dc:date>2014-06-30T23:34:32Z</dc:date>
    <item>
      <title>Shared vs Unshared L3 hits?</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Shared-vs-Unshared-L3-hits/m-p/1052592#M4926</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;I'm new to performance monitoring and I want to make sure I understand everything that goes into calculating the L2 hit ratio.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;In the source code for the Intel PCM software, the L2 hit ratio is calculated as follows:&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;uint64 hits = L2Hit;
uint64 all = L2Hit + L2HitM + L3UnsharedHit + L3Miss;
if (all) return double(hits) / double(all);&lt;/PRE&gt;

&lt;P&gt;The variable name L3UnsharedHit seems to imply that there's something else called an L3SharedHit, which would presumably happen when a load request misses in the L2 but is present in a separate socket's L3 cache. Is there such a thing? Do modern processors with QPI derive any benefit from finding a cache line in another socket's L3, versus having to go out to memory?&lt;/P&gt;

&lt;P&gt;Also, I assume the variable L2HitM means the number of misses in the L2, but that doesn't make sense with the variable's name. I haven't been able to track down exactly what event number and umask that corresponds to in the PMU. Is there a better interpretation?&lt;/P&gt;

&lt;P&gt;Thanks for your time,&lt;BR /&gt;
	David&lt;/P&gt;</description>
      <pubDate>Mon, 30 Jun 2014 20:51:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Shared-vs-Unshared-L3-hits/m-p/1052592#M4926</guid>
      <dc:creator>dsf423</dc:creator>
      <dc:date>2014-06-30T20:51:58Z</dc:date>
    </item>
    <item>
      <title>None of this stuff is easy...</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Shared-vs-Unshared-L3-hits/m-p/1052593#M4927</link>
      <description>&lt;P&gt;None of this stuff is easy... :-(&lt;/P&gt;

&lt;P&gt;There are two important caveats with any attempt to measure cache hit ratios on recent Intel processors:&lt;/P&gt;

&lt;P&gt;1. The primary event used to count hits (MEM_LOAD_UOPS_RETIRED.L2_HIT) only counts demand loads that miss the L1 and hit in the L2.&amp;nbsp;&amp;nbsp; It does not count prefetches that bring the data from the L2 to the L1 in advance of the load.&amp;nbsp; Whether that is what you want to count as a "hit rate" or not depends on whether you are thinking about spatial locality (prefetchability) or temporal locality (data re-use).&lt;/P&gt;

&lt;P&gt;This event also does not count store misses (RFOs) that miss in the L1 and hit in the L2.&lt;/P&gt;

&lt;P&gt;2. When using AVX 32-Byte loads, the MEM_LOAD_UOPS_RETIRED.L2_HIT counter never increments.&amp;nbsp;&amp;nbsp; Instead all L1 misses increment the MEM_LOAD_UOPS_RETIRED.HIT_LFB counter, which normally only increments when there are multiple loads that miss the L1 but point to (usually different parts of) the same cache line.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 30 Jun 2014 21:41:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Shared-vs-Unshared-L3-hits/m-p/1052593#M4927</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2014-06-30T21:41:22Z</dc:date>
    </item>
    <item>
      <title>Thanks Dr. McCalpin. Do you</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Shared-vs-Unshared-L3-hits/m-p/1052594#M4928</link>
      <description>&lt;P&gt;Thanks Dr. McCalpin. Do you happen to have any general resources for doing this kind of work?&lt;/P&gt;

&lt;P&gt;Just out of curiosity- I'm trying to understand why a parallel program suffers performance degradation when I involve two processor sockets (as opposed to just one). For example, a program might run slower on 10 cores across two sockets than on five cores within a single socket. There are lots of general hand-wavy explanations, but I want to rigorously explain the behavior I see in this specific instance. Do you have any suggestions?&lt;/P&gt;

&lt;P&gt;Thanks again,&lt;/P&gt;

&lt;P&gt;David&lt;/P&gt;</description>
      <pubDate>Mon, 30 Jun 2014 22:15:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Shared-vs-Unshared-L3-hits/m-p/1052594#M4928</guid>
      <dc:creator>dsf423</dc:creator>
      <dc:date>2014-06-30T22:15:30Z</dc:date>
    </item>
    <item>
      <title>HitM I believe stands for hit</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Shared-vs-Unshared-L3-hits/m-p/1052595#M4929</link>
      <description>&lt;P&gt;HitM I believe stands for hit modified; that is, another core owns a modified copy of the cache line.&lt;/P&gt;

&lt;P&gt;Threads on different sockets sharing cache lines is not effective.&amp;nbsp; However, BIOS upgrades during the time I was testing IvyTown made a significant improvement.&lt;/P&gt;

&lt;P&gt;I doubt you will be able to test this with any rigor unless you can set affinity and avoid dynamic scheduling.&lt;/P&gt;</description>
      <pubDate>Mon, 30 Jun 2014 23:34:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Shared-vs-Unshared-L3-hits/m-p/1052595#M4929</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-06-30T23:34:32Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;Just out of curiosity- I'm</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Shared-vs-Unshared-L3-hits/m-p/1052596#M4930</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;gt;Just out of curiosity- I'm trying to understand why a parallel program suffers performance degradation when I involve two processor sockets (as opposed to just one). For example, a program might run slower on 10 cores across two sockets than on five cores within a single &amp;gt;&amp;gt;&amp;gt;&lt;/P&gt;

&lt;P&gt;You should take into account also NUMA distance issue.&lt;/P&gt;</description>
      <pubDate>Tue, 01 Jul 2014 05:56:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Shared-vs-Unshared-L3-hits/m-p/1052596#M4930</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2014-07-01T05:56:54Z</dc:date>
    </item>
    <item>
      <title>Having a shared L3 makes core</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Shared-vs-Unshared-L3-hits/m-p/1052597#M4931</link>
      <description>&lt;P&gt;Having a shared L3 makes core-to-core data sharing very fast.&amp;nbsp;&amp;nbsp; It is not quite as fast as getting unshared data from the L3, but the L3 knows which core has the modified data and is able to arrange for a fast cache-to-cache transfer.&amp;nbsp;&amp;nbsp; Table 2-10 in section 2.2.5.1 of the Intel Optimization Reference Manual (document 248966-029, March 2014) says that a "clean" L3 hit has a latency of 26-31 cycles (I measure an average of ~35 cycles for pointer-chasing code), while a hit that is "dirty" in another L1 or L2 on the same chip has a reported latency of 60 cycles (20 ns at 3.0 GHz).&lt;/P&gt;

&lt;P&gt;Latency to modified cache lines in the other socket is much higher -- similar to the remote memory latency of ~135 ns.&amp;nbsp; This gives a latency ratio of between 6:1 and 7:1 in favor of the shared cache configuration.&lt;/P&gt;

&lt;P&gt;The sharing can be either deliberate or accidental (false sharing).&amp;nbsp;&amp;nbsp; Given the intervention latency ratio of ~6.5:1, either case could account for 5 cores on 1 socket running faster than 10 cores on 2 sockets.&amp;nbsp;&amp;nbsp; It is often difficult to come up with an automated test to detect false sharing, but in general one looks for cache-to-cache transfer rates that increase very rapidly with thread count (much faster than a linear increase).&amp;nbsp; You should be able to see this in both the one-socket and two-socket systems, but due to the high ratio of intervention latency between the two cases,&amp;nbsp; you can have false sharing that is tolerable in the single-socket case and intolerable in the two-socket case.&lt;/P&gt;</description>
      <pubDate>Tue, 01 Jul 2014 20:12:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Shared-vs-Unshared-L3-hits/m-p/1052597#M4931</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2014-07-01T20:12:56Z</dc:date>
    </item>
    <item>
      <title>Thanks again everyone. I'll</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Shared-vs-Unshared-L3-hits/m-p/1052598#M4932</link>
      <description>&lt;P&gt;Thanks again everyone. I'll do some digging with this and see what I come up with.&lt;/P&gt;

&lt;P&gt;David&lt;/P&gt;</description>
      <pubDate>Tue, 01 Jul 2014 20:30:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Shared-vs-Unshared-L3-hits/m-p/1052598#M4932</guid>
      <dc:creator>dsf423</dc:creator>
      <dc:date>2014-07-01T20:30:49Z</dc:date>
    </item>
  </channel>
</rss>

