Hello Black,

SB17 · ‎02-13-2014

I am not an expert in memory and maybe I'm wrong . Correct if wrong.

Bandwidth some program ( obtained for example from hardware counters) is not always a indicator of memory subsystem utilization. Different patterns of memory access, the mixture calculations, load / store memory, different bufer size and etc. must be differently utilization memory subsystem.

Is it possible to compare the throughput obtained using hardware counters with STREAM bench result and based on the STREAM results say about the degree of utilization of subsystem memory?

Or the degree of load memory subsystem you can get like some hardware counters or any of other indicators such as Linux ?

Sorry for the philosophical question

Patrick_F_Intel1 · ‎02-13-2014

Hello Black,

So... is bandwidth a good measure of memory system utilization?

I'm sure Dr. McCalpin will have a better answer but here goes.

Here is maybe the easiest answer... if you are getting bw performance close to STREAM then you are probably close to maxing out the 'achievable utilization'. You might have very low pages faults/load, high misses outstanding per clocktick.

At the other end of the spectrum are things like random memory latency benchmarks where each thread has only 1 one miss outstanding per clocktick, each load is causing a page walk/miss. Large servers reading big databases can have similar performance. Depending on one's point of view, a cpu running a workload like this has hit its "achievable utilization" even though the bw is much lower than STREAM. This is one reason why hyper-threading (HT) has helped servers so much... you have additional HT cpus able to have more misses outstanding per socket.

So one could look at read/write bw counters and include page misses per read/write and look at latency events (from which you can also compute misses outstanding/clock) to try and get an idea of how effectively the memory subsystem is being used.

Pat

McCalpinJohn · ‎02-13-2014

There are certainly plenty of applications for which the performance is limited by memory access, but which don't move a lot of data. STREAM has very simple access patterns that allow the generation of lots of concurrency with few hazards. Real application codes are seldom able to generate so much concurrency. Some typical concurrency limiters are TLB misses (there is only one page table walker per core), short vectors (don't allow the hardware prefetch engines to ramp up), indirect memory references (don't typically make good use of the HW prefetchers), too many data streams (the L2 HW prefetchers can only track accesses to a limited number of 4 KiB pages). Code that is not vectorized can run out of out-of-order load slots before generating the maximum number of L1 Data Cache misses (although this is not much of a problem on Sandy Bridge and should be hard to do even on purpose on Haswell).

Of course in multi-socket systems there are additional latency and bandwidth considerations if the data is not all locally allocated. I measure 79 ns local memory latency on my Xeon E5-2680 systems (running at their max Turbo speed of 3.1 GHz) and 122 ns remote memory latency (open page). STREAM bandwidth drops from ~39 GB/s for local memory to under 15 GB/s for remote memory (using 8 threads pinned to one socket in both cases). Performance drops a bit more if both chips are running using only the other chip's memory.

SB17 · ‎02-14-2014

Thanks all for the interesting answers.

Pat proposed approach allows to determine the latency and throughput.
Based on the changes in these parameters (especially changes latency) relatively unloaded system can indirectly see the memory load. Latency increase relatively unloaded system with the same bandwidth says about loading the memory subsystem.

But. I accidentally found today discussion http://software.intel.com/en-us/forums/topic/456184. If I understand correctly, one of the main indicators of loading the memory subsystem is "Reaching maximum bandwidth requires that the memory "pipeline" be filled with requests"(с) by John D. McCalpin.

A naive question, but may be have an easy way and you can see the filling memory "pipeline" through hardware counters and to understand what percentage of occupied memory subsystem?

Can be LFB_hit such indicator ?

McCalpinJohn · ‎02-14-2014

For Sandy Bridge and Ivy Bridge processors, the "LFB_HIT" event ("MEM_LOAD_UOPS_RETIRED.HIT_LFB", Event D1H, Umask 40H) is incremented when a load misses the L1 Data Cache but finds that there is already a cache miss pending (I.e., a Line Fill Buffer has been allocated) for that cache line. The LFB allocation could be due to a hardware prefetch, a software prefetch, or a demand miss to the same cache line. For example, without hardware prefetching, a code consisting of SSE loads to contiguous 16 Byte addresses will issue four loads for each cache line. Of these four loads, 1/4 will miss in the L1 while 3/4 (75%) will miss in the L1 but hit in the LFB (and therefore increment the counter for event D1H/40H). This "HIT_LFB" value increases with increasing L1 prefetch effectiveness, but decreases with the size of the loads (i.e., there are only two aligned AVX loads per cache line, so in the absence of prefetching you would expect 50% L1 misses and 50% L1 miss with LFB hit). So you need to know the distribution of the sizes of the loads to interpret this counter quantitatively.

To get an idea of the average concurrency of L1 cache misses, Event 48H, Umask 01H increments with the number of outstanding L1D misses every cycle. If it works correctly (which I have not checked), then you can simply divide by the cycle count to get the average number of L1 Data Cache misses outstanding over the measurement interval. Setting the CMASK field to 1 will cause the counter to increment every cycle in which there is at least 1 outstanding cache miss. Presumably this would also work for larger CMASK values, so you should be able to count how many cycles there are CMASK or more L1 Data Cache misses outstanding. (I have not tested this either.) Setting the CMASK field to 1 and the EDGE field to 1 will cause the counter to increment whenever there is a change from having 0 outstanding L1 cache misses to having at least 1 outstanding L1 cache miss. This event can only be counting using PMC2, so each test will have to be a separate run.

I don't think that there are any comparable counter events that could be used to estimate the average number of concurrent L2 cache misses.

An indirectly related event is the number of cycles for which there are load misses pending and no uops dispatched. This is an indication (but not proof) that stalls are due to load latencies that are too high for the out-of-order mechanisms in the processor to handle. This, in turn, may indicate inadequate memory concurrency. Event A3H provides the ability to count cycles with L1D misses pending and with L2 load misses pending. This event can also count cycles in which no uops were dispatched and you can combine these masks to count cycles with no dispatch in which there was at least one L1D miss pending or cycles with no dispatch in which there was at least one L2 load miss pending. The UMASKs are modified slightly in Ivy Bridge, but I don't understand the differences (yet). If I recall correctly, at least one user reported some strange results from this event, so more experimentation is probably needed to understand what is actually going on.

McCalpinJohn · ‎02-14-2014

Of course as soon as I posted the previous comment I found a performance counter event that provides some indication of concurrency beyond the L2 cache. On Sandy Bridge, Ivy Bridge, and Haswell, Event 60H "OFFCORE_REQUESTS_OUTSTANDING" looks like it can be used to compute the average number of outstanding reads that have missed in the L2 cache. Like Event 48, the CMASK field can be used to count cycles in which at least 1 miss is outstanding. Unlike Event 48, any of the performance counters can count this event, though the results are only valid if HyperThreading is disabled. Also unlike Event 48, this event can count Demand Data Read misses, Demand Data Store Misses (RFOs), Demand Code Read misses, or "All" "cacheable data read transactions".

The Umask for Code Reads is missing in the documentation for Sandy Bridge, so it might be buggy there.

The encoding for "all cacheable data read transactions" is unusual. I would have expected the Umask to be the union of all of the sub-fields (01 || 02 || 04 = 07h), but 08h is used instead. The use of the word "demand" for Umask 01 (data read misses) and Umask 02 (code read misses) implies that L1 hardware prefetches are not counted. For Umask 04 (RFOs = store misses) I think they have to be demand misses -- there is no indication in the performance optimization manual that the L1 hardware prefetchers perform prefetches for stores. The absence of the word "demand" in the description of Umask 08 (all cacheable data read transactions) leaves open whether RFO transactions are counted. Comparing these definitions with those of Event B0H also leaves it open as to whether L2 HW prefetches are counted. Note that event B0h/Umask 08h provides no indication of whether the "prefetch" transactions include L1 HW prefetches, L2 HW prefetches, or both.

Aside: Some documentation writers treat RFOs as "read" transactions and some do not. RFO is an abbreviation for "Read For Ownership", so it could be counted as a "read", but since it is the result of a "store" instruction, it could also reasonably be excluded from that category. It seems that folks who work in the core (including the L1) tend to *not* count store misses (RFOs) as "read" transactions, while folks out in the uncore consider any transaction that moves data to the core a "read" transaction. The L2 is in between, and I have seen both conventions in common use. Careful directed testing with the L1 and L2 prefetchers enabled and disabled would probably be required to understand exactly what is meant here.

SB17 · ‎02-15-2014

Thank you all for your interesting answers. Go to try

determine memory subsystem utilization