Calculating L2 <-> L3/MEM bandwidth on Intel Skylake SP

Thomas_G_4 · ‎03-28-2018

I'm trying to measure the L2 cache bandwidths and data volumes on Intel Skylake SP platforms. Commonly I use the events L2_LINES_IN.ALL for all cache lines brought into L2 and either L2_LINES_OUT.NON_SILENT or L2_TRANS.L2_WB for evicts from L2. With the non-inclusive L3 cache of the Intel Skylake SP architecture, the events seem to be not sufficient anymore to measure correctly.
The counts in both tests cover the Triad kernel only.

TEST1: STREAM benchmark on a single core with array size 655360 = (5MB), so everything fits into L3 (28MB) (10000 iterations):
Function Best Rate MB/s Avg time Min time Max time
Triad: 12799.9 0.001232 0.001229 0.001764
Runtime 12.3164 seconds
L2_LINES_IN.ALL 23718860 (123.2504 MB/s)
L2_LINES_OUT.NON_SILENT 2463170000 (12799.3764 MB/s)
L2_TRANS.L2_WB 2463168000 (12799.3661 MB/s)
Memory read bandwidth 17.5003 MB/s
Memory read data volume 0.2155 GB
Memory write bandwidth 10.3682 MB/s
Memory write data volume 0.1277 GB

TEST2: STREAM benchmark on a single core with array size 6553600 = (50MB) (1000 iterations):
Function Best Rate MB/s Avg time Min time Max time
Triad: 13768.0 0.011457 0.011424 0.011986
Runtime 13.2558 seconds
L2_LINES_IN.ALL 2327640000 (11238.0380 MB/s)
L2_LINES_OUT.NON_SILENT 2460752000 (11880.7138 MB/s)
L2_TRANS.L2_WB 2460752000 (11880.7138 MB/s)
Memory read bandwidth 9823.2177 MB/s
Memory read data volume 130.2144 GB
Memory write bandwidth 3463.2864 MB/s
Memory write data volume 45.9086 GB

It seems that the event L2_LINES_IN.ALL does not count all lines coming from L3 in the first test. To be sure that data is not coming from memory, I measured the memory data volume in parallel. Since memory data volume is very low, the data should be in L3 and loaded from there into the L2.
In the second test, all cache lines have to be loaded from memory and the results are somewhat reasonable.
Are there separate events for loaded cache lines coming from L3 and from memory? I don't want to program the L3 cache boxes or use the OFFCORE_RESPONSE events.

In the description of the L2_LINES_OUT.NON_SILENT event, it says: Counts the number of lines that are evicted by L2 cache when triggered by an L2 cache fill. Those lines can be either in modified state or clean state. Modified lines may either be written back to L3 or directly written to memory and not allocated in L3. Clean lines may either be allocated in L3 or dropped
Is there any documentation in which cases the system decides to store modified lines in L3 or to write them back to memory? In which cases are clean lines allocated or dropped? Are the clean lines dropped before eviction to L3 or are they transferred and the L3 decides to drop them?

McCalpinJohn · ‎03-28-2018

I have been struggling with some of these counters as well... The flexibility of the cache protocol and the limited detail in the event descriptions makes it difficult to design tests to validate hypotheses about the behavior or the accuracy of the counts.

First question: Did you compile the STREAM benchmark with streaming stores or without? For cache-contained data you definitely don't want streaming stores, and using streaming stores completely changes the way data moves around, so it requires a very different analysis. (Streaming stores may also impact the accuracy of the counters.)

Assuming that you compiled without streaming stores....

In the first set of results above, it looks like the L2 is writing back all clean and dirty data. In this case there is no data in S state (only E or M), so the counts of L2_LINES_OUT.NON_SILENT and L2_TRANS.L2_WB should be the same (and they are) and the counts should correspond to reading (and victimizing) 3 arrays of data per iteration. 3 arrays of data is the same traffic that the Triad BW calculation assumes, so I would expect L2_LINES_OUT.NON_SILENT and L2_TRANS.L2_WB to have rates of 12.8 GB/s (and they do). The L2_LINES_IN.ALL should be counting 17.07 GB/s (12.8 GB/s going into the L2 from the L3 plus 4.27 GB/s going into the L2 from the L1), and instead it is counting approximately zero. That is not a good sign.

In the second set of results above, the DRAM read bandwidth is close to the value that you would expect for a case compiled with streaming stores -- 2/3 of 13768 MB/s is 9179 MB/s and you measured 9823 MB/s. So your measured bandwidth is either 7% high (if the code uses streaming stores) or 29% low (if the code does not use streaming stores). The DRAM write bandwidth is 25% too low, so I will assume that the code was compiled without streaming stores and that you are getting unwanted L3 hits because the arrays are not big enough to ensure that the L3 is fully flushed in each iteration. (The STREAM run rules require that each array be 4x the size of the aggregate cache available, so you need to at least double the array sizes. That is just a rule of thumb -- I often just jump immediately to N=80 million, which is the largest round number that allows all three arrays to fit in 32 bits. For larger sizes you need to add "-mcmodel=medium" or modify the code to allocate the arrays dynamically.) At least in this case you are getting plausible counts for L2_LINES_IN.ALL -- the 11.24 GB/s is about 20% low if the event is only supposed to count L2 lines in from L3+DRAM, or it is 40% low if the event is supposed to count L2 lines in from both the "outside" (L3+DRAM) and the "inside" (L1 Writebacks). The L2_LINES_OUT.NON_SILENT and L2_TRANS.L2_WB are identical again, with values large enough to confirm that the L2 is writing back both clean and dirty data -- about 15% less than expected, but way higher than the 4.59 GB/s rate expected for just the dirty data (assuming no L3 hits).

There is definitely a lot more work to do to understand this chip....

Krishnaswa_V_Intel · ‎03-28-2018

Can you double check the event id and umask that you are using for L2_LINES_IN.ALL. Event id should be 0xf1 and umask 0x1f. I have never seen any case where L2_LINES_IN.ALL is so off. It must be due to some other error

Thomas_G_4 · ‎04-09-2018

Hi,

Thanks for the analysis of the STREAM values, it helped me to get a better picture of the chip. I used GCC 5.4.0 with options -O3 -ftree-vectorize. There are no nt-stores in the assembly.

And another thanks for the hint to double-check the umask. I copied the configuration of the Skylake Desktop and there the umask is just 0x7 and not 0x1f. With the changes the values look more promising. Now starts the difficult part trying to validate the counts (and hopefully find events that can differentiate between loads from L3 and loads from memory). I'll post updates here.

Best,
Thomas

Travis_D_ · ‎09-28-2018

McCalpin, John wrote:

In the first set of results above, it looks like the L2 is writing back all clean and dirty data. In this case there is no data in S state (only E or M), so the counts of L2_LINES_OUT.NON_SILENT and L2_TRANS.L2_WB should be the same (and they are)

Out of curiosity, in what scenario would NON_SILENT and L2_WB not be the same, or said another way - what is the difference between these two events?

S lines are dropped silently, right? M lines are written back. What about E lines? They can be evicted non-silently, but don't need to be written back?

McCalpinJohn · ‎10-05-2018

From the Intel documentation it is clear that evictions can be "silent" or "non-silent", but I have not seen any documentation of which transactions fall into each category. In the olden days this used to be a simple decision based on transaction type, but with modern Intel processors there is a fair likelihood that at least some transactions can be of either class, with the choice based on buffer occupancy or history-based predictors.

I would assume that a dirty L2 WB would always be non-silent -- especially if it is sent to the L3 (which is co-located with the CHA).

On at least one system that I have helped design, evictions of clean E state lines was non-silent. A "clean replacement notification" is sent to the directory so that it knows that no cache can have a dirty copy of the line. My current interpretation is that SKX processors provide eviction notification to the (local) snoop filter on clean E-state victims, but that they do not provide notification to the home directory on clean E-state victims that belong to remote nodes. Lots of hypotheses in this area can be tested, but it is important to be careful of context -- with dynamically adapting mechanisms, the same cache state transition may generate a completely different pattern of bus transactions depending on load and perhaps on history-based prediction mechanisms.

Notifications on evictions of S state lines are possible, but (in my experience) are not as widely used as notifications on evictions of clean E state lines. The benefits of having a more up-to-date directory (snoop filter) have to be weighed against the overhead of the additional bus traffic. In addition, some designs don't precisely track S state lines. In a large NUMA system, for example, the tracking of S-state lines may be by "node", without keeping track of how many caches in that node have a copy of the line. In such cases there may also be additional cache-to-cache copies of S state lines without notifying the directory, making it even harder to know when a line no longer has any shared copies in an entity tracked by a single bit.

I have had no luck finding any performance counter events that allow me to track writes to the L3 (measured at the L3) on SKX. LLC_LOOKUPS.WRITE counts are several orders of magnitude too low in the three tests that I have done -- LLC writes due to Snoop Filter Evictions, LLC writes due to clean L2 victims, and LLC writes due to dirty L2 victims. This makes some analyses harder to interpret unambiguously.....

Travis_D_ · ‎10-08-2018

Thanks Dr. McCalpin, your answer is helpful as always - even though we don't yet have a full picture of how it works as you point out.

You may find this question interesting. It was found that SKL shows WB events (not simply non-silent evictions) in a case where I'd expect no WBs: where a workload fit entirely in the L3 cache.

McCalpinJohn · ‎10-09-2018

I have never had access to an SKL (client) processor, and have had limited access to client processors in previous generations. (I had some Xeon E3-1270 (v1) processors that I did a fair amount of work with, but that has been 5-6 years and I don't remember very much. I have access to Haswell-generation client processors in my Mac systems, but have not done any detailed performance analysis on these.)

Concerning the experiments reported at https://stackoverflow.com/questions/52565303/on-skylake-skl-why-are-there-l2-writebacks-in-a-read-only-workload-that-exceed.

For data sets larger than the L1 cache, it is critical to know whether the data was on 4KiB pages or 2MiB pages (i.e., "transparent huge pages").
- I usually use an anonymous mmap with an array size that is a multiple of 2MiB to encourage the use of 2MiB transparent huge pages, then I print the virtual address of the base of the array to see if it is 2MiB-aligned.
- With 4KiB pages, 4 bits of L2 cache index are subject to virtual to physical translation on Skylake client processors (256 KiB, 4-way associative = 64 KiB = 16 4KiB pages = 4 bits). If the (random) 4KiB page addresses don't have a uniform distribution of these L2 cache index bits, then L2 overflows will occur prematurely.
The events L2_LINES_OUT.SILENT and L2_LINES_OUT.NON_SILENT are not included in the entry for this SKL processor (Table 19-4) in Chapter 19 of Volume 3 of the SWDM.
- The "silent" vs "non-silent" distinction could be relevant to the inclusive L3 of SKL as it is relevant to the inclusive Snoop Filter of SKX, but the inconsistency in the documentation gives some cause for concern.
I went through all the tables in Chapter 19 of the SWDM and looked up the Umasks for Event 0xF2 for both the "client" and "server" models for Nehalem/Westmere, Sandy Bridge/Ivy Bridge, Haswell/Broadwell, and Skylake. I then repeated the exercise using the tables at http://download.01.org/perfmon/.
- Client and server models Nehalem/Westmere and SNB/IVB all have the same definitions of the Umask bits, and there are no differences between Volume 3 of the SWDM and the 01.org site.
  - bit 3: dirty line displaced by a HW prefetch
  - bit 2: clean line displaced by a HW prefetch
  - bit 1: dirty line displaced by a demand access
  - bit 0: clean line displaced by a demand access
- The event encoding clearly changes between NHM->IVB and Haswell, with my interpretation of the new encoding as:
  - bit 3: not used (maybe PF access?)
  - bit 2: demand access
  - bit 1: dirty victim line
  - bit 2: clean victim line
- Curiously, Broadwell shares Umask 0x05 ("Clean L2 cache lines evicted by demand"), but does not include Umask 0x06 in either reference.
- SKX clearly changes the meaning of the Umask bits again:
  - Bit 2 refers to lines that were brought into the L2 cache by the HW Prefetcher(s), but which were never accessed before being evicted.
  - Bit 1 refers to non-silent victims (i.e., an eviction notification was sent to the Snoop Filter, or the specific WB transaction used provides comparable information to the Snoop Filter).
  - Bit 0 refers to silent victims (i.e., no eviction notification was sent to the Snoop Filter -- only possible for clean victims).

There certainly remains a lot of work to do to understand these processors.....

HadiBrais · ‎03-10-2019

Regarding the question posted in comment #7 by Travis D., I have repeated the experiments on a Coffee Lake processor but with the following changes:

Using 1GB pages instead of 4KB pages. Since the largest input size is 2^20 KBs, all input sizes fit within a single 1GB page. This guarantees that the whole array is contiguous in the physical address space and is therefore evenly distributed over the L1, L2, and L3 cache sets.
Instead of using perf with the -D498 switch, I've used the LIKWID wrapper APIs to ensure that the core event counts are precisely measured over the loop of interest.

I've also disabled hyperthreading and all L1 and L2 hardware prefetchers to simplify the analysis.

Note that the L2_LINES_OUT.SILENT and L2_LINES_OUT.NON_SILENT events are documented for Skylake as shown in https://download.01.org/perfmon/index/. These also seem work on my CFL processor.

I've measured the following core events:

L2_RQSTS_ALL_DEMAND_DATA_RD
L2_RQSTS_ALL_CODE_RD
L2_RQSTS_ALL_DEMAND_REFERENCES
L2_RQSTS_REFERENCES
L2_LINES_OUT_SILENT
L2_LINES_OUT_NON_SILENT
L2_TRANS_L2_WB
L2_LINES_IN_ALL
L2_LINES_IN_S
L2_LINES_IN_E
L2_LINES_IN_I

The last three events are measured in a separate run due to the limited number of core PMU counters.

I observed the following:

The counts of L2_LINES_OUT_SILENT, L2_LINES_OUT_NON_SILENT, L2_TRANS_L2_WB, and L2_LINES_IN_ALL are exactly the same as what Travis has observed on SKL. That is, L2_TRANS_L2_WB is equal to L2_LINES_OUT_NON_SILENT. The normalized L2_LINES_IN_ALL value is equal 1 when the size of the array is larger than the L2 size (i.e., 256KB). Basically the graph looks exactly the same as the one from the Stack Overflow post. So the question still stands, which is why are there so many L2_LINES_OUT_NON_SILENT events when the array size is larger than the L3.
L2_RQSTS_ALL_CODE_RD increases slightly but consistently with larger input sizes. However, the largest count is smaller than 1500, so it's negligible with respect to other event counts when the input size is 64KB or larger (i.e., doesn't fit in the L1D).
L2_LINES_IN_E is almost equal to L2_LINES_IN_ALL. This is expected because we know that no other core accesses the same lines. This shows that most of the lines that are brought from the L3 into the L2 are in the E state, and since the loop being measured does not modify any cache lines and since the L3 is inclusive of the L2, it doesn't make sense for the L2 to writeback any lines into the L3. So we expect to see only L2_LINES_OUT_SILENT events.
L2_RQSTS_ALL_DEMAND_REFERENCES is equal to L2_RQSTS_REFERENCES. This is expected because the L1 and L2 hardware prefetchers are disabled. (I'm not sure whether these count software prefetches, but this is irrelevant because there are not software prefetches.)
L2_RQSTS_ALL_DEMAND_REFERENCES is equal to the sum of L2_RQSTS_ALL_CODE_RD and L2_RQSTS_ALL_DEMAND_DATA_RD. This is expected because that's how the manual basically describes them at least according to my understanding.

I thought of using other (offcore or uncore) performance events to count the number of L2 writebacks. To my knowledge, OFFCORE_RESPONSE cannot be used for this purpose. However, some uncore CBox events look useful:

UNC_CBO_CACHE_LOOKUP.WRITE_MESI: My understanding of this event is that it counts the number of cache lines written into the L3 slice that corresponds to the specified CBox from one of the private L2 caches.
UNC_CBO_CACHE_LOOKUP.READ_I: My understanding of this event is that it counts the number of cache line read requests to the L3 slice that corresponds to the specified CBox from one of the private L2 caches and the require line was found in the I state (i.e., not found, i.e., L3 read (or RFO?) miss).
UNC_ARB_TRK_REQUESTS.DRD_DIRECT: This ARB unit event occurs for each read request sent to the memory controller.
UNC_ARB_TRK_REQUESTS.WRITES: The documentation says that this event counts evictions. I think it refers to L3 writebacks to the memory controller.
UNC_ARB_TRK_REQUESTS.EVICTIONS: This event is only documented for client IVB and SNB.

I've made the following additional observations using these counters:

The sum of UNC_CBO_CACHE_LOOKUP.READ_I over all CBoxes is slightly smaller than the total number of load instructions retired. We expect that they are equal, but there are within 10% of each other. I suspect that the L2 streamer prefetcher functionality that prefetches into the L3 cannot be disabled. Or perhaps this is an effect of the replacement policy of the L3.
The UNC_CBO_CACHE_LOOKUP.WRITE_MESI count is never larger than 40 million for all input sizes. If my understanding of this counter is correct, this would be an significant inconsistency with respect to L2_LINES_OUT_NON_SILENT. According to WRITE_M, most writes are to lines that are already in the M state.
UNC_ARB_TRK_REQUESTS.EVICTIONS is very close to UNC_ARB_TRK_REQUESTS.WRITES and the sum of UNC_CBO_CACHE_LOOKUP.WRITE_MESI over all CBoxes. That is, they are all under 40 million for input sizes. So again if my understanding is correct, the number of L3 evictions doesn't seem to correlate with the input size. If the L2 was indeed writing dirty lines back to the L3, then I think the number of L3 evictions should be close to the number of non-silent L2 evictions.
UNC_ARB_TRK_REQUESTS.DRD_DIRECT is about equal to the number of load instructions when the array doesn't fit in the L3. Alternatively, OFFCORE_RESPONSE can be used to show this.

Ive made other "on the side" observations which I don't undertand:

My CPU has 6 physical cores, but I can only use 5 CBoxes.
My CFL processor doesn't have an IMC PMU. However, I have another client HSW processor that seems to have an IMC PMU. I don't know where the IMC PMU is documented for client processors.