<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic determine memory subsystem utilization in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/determine-memory-subsystem-utilization/m-p/984229#M3207</link>
    <description>&lt;P&gt;&lt;SPAN id="result_box" lang="en"&gt;&lt;SPAN title="Я не специалист в памяти и может я не прав."&gt;I am not an expert in memory and maybe I'm wrong . &lt;/SPAN&gt;&lt;SPAN title="Поправьте если не прав.

"&gt;Correct if wrong.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN id="result_box" lang="en"&gt;&lt;SPAN title="Пропускная способность какой либо программы (полученная например из аппаратных счетчиков) не всегда является показателем производительности."&gt;Bandwidth some program ( obtained for example from hardware counters) is not always a indicator of memory subsystem utilization. &lt;/SPAN&gt;&lt;SPAN title="Различные шаблоны доступа к памяти, смесь вычислений и загрузка/чтение памяти должны по разному утилизировать память.

"&gt;Different patterns of memory access, the mixture calculations, load / store memory, different bufer size and etc. must be differently utilization memory subsystem.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN id="result_box" lang="en"&gt;&lt;SPAN class="hps"&gt;Is it possible to&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;compare the&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;throughput&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;obtained&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;using hardware&lt;/SPAN&gt; counters with STREAM bench result and &lt;SPAN class="hps"&gt;based on the&lt;/SPAN&gt; STREAM &lt;SPAN class="hps"&gt;results&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;say&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;about the degree of&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;utilization of&lt;/SPAN&gt; subsystem &lt;SPAN class="hps"&gt;memory&lt;/SPAN&gt;&lt;SPAN&gt;?&lt;/SPAN&gt; &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN id="result_box" lang="en"&gt;&lt;SPAN class="hps"&gt;Or the degree of&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;load&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;memory subsystem you can&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;get&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;like some &lt;/SPAN&gt;&lt;SPAN class="hps"&gt;hardware&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;counters&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;or any of&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;other indicators&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;such as&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;Linux&lt;/SPAN&gt;&lt;/SPAN&gt; ?&lt;/P&gt;

&lt;P&gt;&lt;SPAN class="short_text" id="result_box" lang="en"&gt;&lt;SPAN class="hps alt-edited"&gt;Sorry for the&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;philosophical question&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 13 Feb 2014 17:30:09 GMT</pubDate>
    <dc:creator>SB17</dc:creator>
    <dc:date>2014-02-13T17:30:09Z</dc:date>
    <item>
      <title>determine memory subsystem utilization</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/determine-memory-subsystem-utilization/m-p/984229#M3207</link>
      <description>&lt;P&gt;&lt;SPAN id="result_box" lang="en"&gt;&lt;SPAN title="Я не специалист в памяти и может я не прав."&gt;I am not an expert in memory and maybe I'm wrong . &lt;/SPAN&gt;&lt;SPAN title="Поправьте если не прав.

"&gt;Correct if wrong.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN id="result_box" lang="en"&gt;&lt;SPAN title="Пропускная способность какой либо программы (полученная например из аппаратных счетчиков) не всегда является показателем производительности."&gt;Bandwidth some program ( obtained for example from hardware counters) is not always a indicator of memory subsystem utilization. &lt;/SPAN&gt;&lt;SPAN title="Различные шаблоны доступа к памяти, смесь вычислений и загрузка/чтение памяти должны по разному утилизировать память.

"&gt;Different patterns of memory access, the mixture calculations, load / store memory, different bufer size and etc. must be differently utilization memory subsystem.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN id="result_box" lang="en"&gt;&lt;SPAN class="hps"&gt;Is it possible to&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;compare the&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;throughput&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;obtained&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;using hardware&lt;/SPAN&gt; counters with STREAM bench result and &lt;SPAN class="hps"&gt;based on the&lt;/SPAN&gt; STREAM &lt;SPAN class="hps"&gt;results&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;say&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;about the degree of&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;utilization of&lt;/SPAN&gt; subsystem &lt;SPAN class="hps"&gt;memory&lt;/SPAN&gt;&lt;SPAN&gt;?&lt;/SPAN&gt; &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN id="result_box" lang="en"&gt;&lt;SPAN class="hps"&gt;Or the degree of&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;load&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;memory subsystem you can&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;get&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;like some &lt;/SPAN&gt;&lt;SPAN class="hps"&gt;hardware&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;counters&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;or any of&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;other indicators&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;such as&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;Linux&lt;/SPAN&gt;&lt;/SPAN&gt; ?&lt;/P&gt;

&lt;P&gt;&lt;SPAN class="short_text" id="result_box" lang="en"&gt;&lt;SPAN class="hps alt-edited"&gt;Sorry for the&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;philosophical question&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 13 Feb 2014 17:30:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/determine-memory-subsystem-utilization/m-p/984229#M3207</guid>
      <dc:creator>SB17</dc:creator>
      <dc:date>2014-02-13T17:30:09Z</dc:date>
    </item>
    <item>
      <title>Hello Black,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/determine-memory-subsystem-utilization/m-p/984230#M3208</link>
      <description>&lt;P&gt;Hello Black,&lt;/P&gt;

&lt;P&gt;So... is bandwidth a good measure of memory system utilization?&lt;/P&gt;

&lt;P&gt;I'm sure Dr. McCalpin will have a better answer but here goes.&lt;/P&gt;

&lt;P&gt;Here is maybe the easiest answer... if you are getting bw performance close to STREAM then you are probably close to maxing out the 'achievable utilization'. You might have very low pages faults/load, high misses outstanding per clocktick.&lt;/P&gt;

&lt;P&gt;At the other end of the spectrum are things like random memory latency benchmarks where each thread has only 1 one miss outstanding per clocktick, each load is causing a page walk/miss. Large servers reading big databases can have similar performance. Depending on one's point of view, a cpu running a workload like this has hit its "achievable utilization" even though the bw is much lower than STREAM. This is one reason why hyper-threading (HT) has helped servers so much... you have additional HT cpus able to have more misses outstanding per socket.&lt;/P&gt;

&lt;P&gt;So one could look at read/write bw counters and include page misses per read/write and look at latency events (from which you can also compute misses outstanding/clock)&amp;nbsp;to try and get an idea of how effectively the memory subsystem is being used.&lt;/P&gt;

&lt;P&gt;Pat&lt;/P&gt;</description>
      <pubDate>Thu, 13 Feb 2014 18:11:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/determine-memory-subsystem-utilization/m-p/984230#M3208</guid>
      <dc:creator>Patrick_F_Intel1</dc:creator>
      <dc:date>2014-02-13T18:11:38Z</dc:date>
    </item>
    <item>
      <title>There are certainly plenty of</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/determine-memory-subsystem-utilization/m-p/984231#M3209</link>
      <description>&lt;P&gt;There are certainly plenty of applications for which the performance is limited by memory access, but which don't move a lot of data.&amp;nbsp;&amp;nbsp; STREAM has very simple access patterns that allow the generation of lots of concurrency with few hazards.&amp;nbsp;&amp;nbsp; Real application codes are seldom able to generate so much concurrency.&amp;nbsp;&amp;nbsp; Some typical concurrency limiters are TLB misses (there is only one page table walker per core), short vectors (don't allow the hardware prefetch engines to ramp up), indirect memory references (don't typically make good use of the HW prefetchers), too many data streams (the L2 HW prefetchers can only track accesses to a limited number of 4 KiB pages).&amp;nbsp;&amp;nbsp;&amp;nbsp; Code that is not vectorized can run out of out-of-order load slots before generating the maximum number of L1 Data Cache misses (although this is not much of a problem on Sandy Bridge and should be hard to do even on purpose on Haswell).&lt;/P&gt;

&lt;P&gt;Of course in multi-socket systems there are additional latency and bandwidth considerations if the data is not all locally allocated.&amp;nbsp;&amp;nbsp; I measure 79 ns local memory latency on my Xeon E5-2680 systems (running at their max Turbo speed of 3.1 GHz) and 122 ns remote memory latency (open page).&amp;nbsp;&amp;nbsp; STREAM bandwidth drops from ~39 GB/s for local memory to under 15 GB/s for remote memory (using 8 threads pinned to one socket in both cases).&amp;nbsp; Performance drops a bit more if both chips are running using only the other chip's memory.&lt;/P&gt;</description>
      <pubDate>Thu, 13 Feb 2014 21:21:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/determine-memory-subsystem-utilization/m-p/984231#M3209</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2014-02-13T21:21:14Z</dc:date>
    </item>
    <item>
      <title>Thanks all for the</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/determine-memory-subsystem-utilization/m-p/984232#M3210</link>
      <description>&lt;P&gt;&lt;SPAN id="result_box" lang="en"&gt;&lt;SPAN class="hps"&gt;Thanks all for the&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;interesting answers&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt; &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN id="result_box" lang="en"&gt;Pat &lt;SPAN class="hps"&gt;proposed&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;approach&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;allows to determine&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;the latency&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;and throughput.&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN class="hps"&gt;Based on the&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;changes in these parameters&lt;/SPAN&gt; &lt;SPAN class="hps atn"&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;especially&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;changes&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;latency&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;relatively&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;unloaded&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;system&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;can&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;indirectly&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;see&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;the memory load&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;Latency increase&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;relatively&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;unloaded&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;system&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;with the same&lt;/SPAN&gt; bandwidth &lt;SPAN class="hps"&gt;says&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;about loading&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;the memory subsystem.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="en"&gt;&lt;SPAN class="hps"&gt;But. I accidentally found&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;today&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;discussion&lt;/SPAN&gt;&lt;/SPAN&gt; &lt;A href="http://software.intel.com/en-us/forums/topic/456184" target="_blank"&gt;http://software.intel.com/en-us/forums/topic/456184&lt;/A&gt;. &lt;SPAN id="result_box" lang="en"&gt;&lt;SPAN class="hps"&gt;If I understand correctly&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;one of&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;the main indicators of&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;loading&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;the memory subsystem&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;is&lt;/SPAN&gt;&lt;/SPAN&gt; "Reaching maximum bandwidth requires that the memory "pipeline" be filled with requests"&lt;A href="http://software.intel.com/en-us/user/545611"&gt;(с) by John D. McCalpin&lt;/A&gt;.&lt;/P&gt;

&lt;P&gt;&lt;SPAN id="result_box" lang="en"&gt;&lt;SPAN class="hps"&gt;A naive question&lt;/SPAN&gt;&lt;SPAN&gt;, but m&lt;/SPAN&gt;&lt;/SPAN&gt;ay be &lt;SPAN id="result_box" lang="en"&gt;&lt;SPAN class="hps"&gt;have an easy&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;way&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;and&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;you can&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;see the&lt;/SPAN&gt; filling&lt;SPAN class="hps"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;memory "pipeline"&lt;SPAN lang="en"&gt;&lt;SPAN class="hps"&gt; through&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;hardware&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;counters and&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;to&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;understand&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;what percentage of&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;occupied&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;memory subsystem&lt;/SPAN&gt;&lt;SPAN&gt;?&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Can be LFB_hit such indicator ?&lt;/P&gt;</description>
      <pubDate>Fri, 14 Feb 2014 15:25:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/determine-memory-subsystem-utilization/m-p/984232#M3210</guid>
      <dc:creator>SB17</dc:creator>
      <dc:date>2014-02-14T15:25:04Z</dc:date>
    </item>
    <item>
      <title>For Sandy Bridge and Ivy</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/determine-memory-subsystem-utilization/m-p/984233#M3211</link>
      <description>&lt;P&gt;For Sandy Bridge and Ivy Bridge processors, the "LFB_HIT" event ("MEM_LOAD_UOPS_RETIRED.HIT_LFB", Event D1H, Umask 40H) is incremented when a load misses the L1 Data Cache but finds that there is already a cache miss pending (I.e., a Line Fill Buffer has been allocated) for that cache line.&amp;nbsp; The LFB allocation could be due to a hardware prefetch, a software prefetch, or a demand miss to the same cache line.&amp;nbsp;&amp;nbsp; For example, without hardware prefetching, a code consisting of SSE loads to contiguous 16 Byte addresses will issue four loads for each cache line.&amp;nbsp; Of these four loads, 1/4 will miss in the L1 while 3/4 (75%) will miss in the L1 but hit in the LFB (and therefore increment the counter for event D1H/40H).&amp;nbsp;&amp;nbsp; This "HIT_LFB" value increases with increasing L1 prefetch effectiveness, but decreases with the size of the loads (i.e., there are only two aligned AVX loads per cache line, so in the absence of prefetching you would expect 50% L1 misses and 50% L1 miss with LFB hit).&amp;nbsp;&amp;nbsp; So you need to know the distribution of the sizes of the loads to interpret this counter quantitatively.&lt;/P&gt;

&lt;P&gt;To get an idea of the average concurrency of L1 cache misses, Event 48H, Umask 01H increments with the number of outstanding L1D misses every cycle.&amp;nbsp; If it works correctly (which I have not checked), then you can simply divide by the cycle count to get the average number of L1 Data Cache misses outstanding over the measurement interval.&amp;nbsp; Setting the CMASK field to 1 will cause the counter to increment every cycle in which there is at least 1 outstanding cache miss.&amp;nbsp; Presumably this would also work for larger CMASK values, so you should be able to count how many cycles there are CMASK or more L1 Data Cache misses outstanding.&amp;nbsp; (I have not tested this either.)&amp;nbsp;&amp;nbsp;&amp;nbsp; Setting the CMASK field to 1 and the EDGE field to 1 will cause the counter to increment whenever there is a change from having 0 outstanding L1 cache misses to having at least 1 outstanding L1 cache miss.&amp;nbsp;&amp;nbsp;&amp;nbsp; This event can only be counting using PMC2, so each test will have to be a separate run.&lt;/P&gt;

&lt;P&gt;I don't think that there are any comparable counter events that could be used to estimate the average number of concurrent L2 cache misses.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;An indirectly related event is the number of cycles for which there are load misses pending and no uops dispatched.&amp;nbsp;&amp;nbsp; This is an indication (but not proof) that stalls are due to load latencies that are too high for the out-of-order mechanisms in the processor to handle.&amp;nbsp; This, in turn, may indicate inadequate memory concurrency.&amp;nbsp; Event A3H provides the ability to count cycles with L1D misses pending and with L2 load misses pending.&amp;nbsp; This event can also count cycles in which no uops were dispatched and you can combine these masks to count cycles with no dispatch in which there was at least one L1D miss pending or cycles with no dispatch in which there was at least one L2 load miss pending.&amp;nbsp; The UMASKs are modified slightly in Ivy Bridge, but I don't understand the differences (yet).&amp;nbsp;&amp;nbsp; If I recall correctly, at least one user reported some strange results from this event, so more experimentation is probably needed to understand what is actually going on.&lt;/P&gt;</description>
      <pubDate>Fri, 14 Feb 2014 19:15:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/determine-memory-subsystem-utilization/m-p/984233#M3211</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2014-02-14T19:15:52Z</dc:date>
    </item>
    <item>
      <title>Of course as soon as I posted</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/determine-memory-subsystem-utilization/m-p/984234#M3212</link>
      <description>&lt;P&gt;Of course as soon as I posted the previous comment I found a performance counter event that provides some indication of concurrency beyond the L2 cache.&amp;nbsp; On Sandy Bridge, Ivy Bridge, and Haswell,&amp;nbsp; Event 60H "OFFCORE_REQUESTS_OUTSTANDING" looks like it can be used to compute the average number of outstanding reads that have missed in the L2 cache.&amp;nbsp; Like Event 48, the CMASK field can be used to count cycles in which at least 1 miss is outstanding.&amp;nbsp; Unlike Event 48, any of the performance counters can count this event, though the results are only valid if HyperThreading is disabled.&amp;nbsp; Also unlike Event 48, this event can count Demand Data Read misses, Demand Data Store Misses (RFOs), Demand Code Read misses, or "All" "cacheable data read transactions".&lt;/P&gt;

&lt;P&gt;The Umask for Code Reads is missing in the documentation for Sandy Bridge, so it might be buggy there.&lt;/P&gt;

&lt;P&gt;The encoding for "all cacheable data read transactions" is unusual.&amp;nbsp; I would have expected the Umask to be the union of all of the sub-fields (01 || 02 || 04 = 07h), but 08h is used instead.&amp;nbsp;&amp;nbsp; The use of the word "demand" for Umask 01 (data read misses) and Umask 02 (code read misses) implies that L1 hardware prefetches are not counted.&amp;nbsp; For Umask 04 (RFOs = store misses) I think they have to be demand misses -- there is no indication in the performance optimization manual that the L1 hardware prefetchers perform prefetches for stores.&amp;nbsp;&amp;nbsp; The absence of the word "demand" in the description of Umask 08 (all cacheable data read transactions) leaves open whether RFO transactions are counted.&amp;nbsp;&amp;nbsp; Comparing these definitions with those of Event B0H also leaves it open as to whether L2 HW prefetches are counted.&amp;nbsp; Note that event B0h/Umask 08h provides no indication of whether the "prefetch" transactions include L1 HW prefetches, L2 HW prefetches, or both.&lt;/P&gt;

&lt;P&gt;Aside: Some documentation writers treat RFOs as "read" transactions and some do not.&amp;nbsp; RFO is an abbreviation for "Read For Ownership", so it could be counted as a "read", but since it is the result of a "store" instruction, it could also reasonably be excluded from that category.&amp;nbsp;&amp;nbsp; It seems that folks who work in the core (including the L1) tend to *not* count store misses (RFOs) as "read" transactions, while folks out in the uncore consider any transaction that moves data to the core a "read" transaction.&amp;nbsp;&amp;nbsp; The L2 is in between, and I have seen both conventions in common use.&amp;nbsp; Careful directed testing with the L1 and L2 prefetchers enabled and disabled would probably be required to understand exactly what is meant here.&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 14 Feb 2014 20:21:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/determine-memory-subsystem-utilization/m-p/984234#M3212</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2014-02-14T20:21:38Z</dc:date>
    </item>
    <item>
      <title>Thank you all for your</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/determine-memory-subsystem-utilization/m-p/984235#M3213</link>
      <description>&lt;P&gt;&lt;SPAN class="short_text" id="result_box" lang="en"&gt;&lt;SPAN class="hps"&gt;Thank&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;you all for your&lt;/SPAN&gt; &lt;SPAN class="hps"&gt;interesting answers&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt; G&lt;SPAN class="hps"&gt;o&lt;/SPAN&gt; to &lt;SPAN class="hps"&gt;try&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 15 Feb 2014 16:24:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/determine-memory-subsystem-utilization/m-p/984235#M3213</guid>
      <dc:creator>SB17</dc:creator>
      <dc:date>2014-02-15T16:24:27Z</dc:date>
    </item>
  </channel>
</rss>

