I'm trying to measure the L2 cache bandwidths and data volumes on Intel Skylake SP platforms. Commonly I use the events L2_LINES_IN.ALL for all cache lines brought into L2 and either L2_LINES_OUT.NON_SILENT or L2_TRANS.L2_WB for evicts from L2. With the non-inclusive L3 cache of the Intel Skylake SP architecture, the events seem to be not sufficient anymore to measure correctly.
The counts in both tests cover the Triad kernel only.
TEST1: STREAM benchmark on a single core with array size 655360 = (5MB), so everything fits into L3 (28MB) (10000 iterations):
Function Best Rate MB/s Avg time Min time Max time
Triad: 12799.9 0.001232 0.001229 0.001764
Runtime 12.3164 seconds
L2_LINES_IN.ALL 23718860 (123.2504 MB/s)
L2_LINES_OUT.NON_SILENT 2463170000 (12799.3764 MB/s)
L2_TRANS.L2_WB 2463168000 (12799.3661 MB/s)
Memory read bandwidth 17.5003 MB/s
Memory read data volume 0.2155 GB
Memory write bandwidth 10.3682 MB/s
Memory write data volume 0.1277 GB
TEST2: STREAM benchmark on a single core with array size 6553600 = (50MB) (1000 iterations):
Function Best Rate MB/s Avg time Min time Max time
Triad: 13768.0 0.011457 0.011424 0.011986
Runtime 13.2558 seconds
L2_LINES_IN.ALL 2327640000 (11238.0380 MB/s)
L2_LINES_OUT.NON_SILENT 2460752000 (11880.7138 MB/s)
L2_TRANS.L2_WB 2460752000 (11880.7138 MB/s)
Memory read bandwidth 9823.2177 MB/s
Memory read data volume 130.2144 GB
Memory write bandwidth 3463.2864 MB/s
Memory write data volume 45.9086 GB
It seems that the event L2_LINES_IN.ALL does not count all lines coming from L3 in the first test. To be sure that data is not coming from memory, I measured the memory data volume in parallel. Since memory data volume is very low, the data should be in L3 and loaded from there into the L2.
In the second test, all cache lines have to be loaded from memory and the results are somewhat reasonable.
Are there separate events for loaded cache lines coming from L3 and from memory? I don't want to program the L3 cache boxes or use the OFFCORE_RESPONSE events.
In the description of the L2_LINES_OUT.NON_SILENT event, it says: Counts the number of lines that are evicted by L2 cache when triggered by an L2 cache fill. Those lines can be either in modified state or clean state. Modified lines may either be written back to L3 or directly written to memory and not allocated in L3. Clean lines may either be allocated in L3 or dropped
Is there any documentation in which cases the system decides to store modified lines in L3 or to write them back to memory? In which cases are clean lines allocated or dropped? Are the clean lines dropped before eviction to L3 or are they transferred and the L3 decides to drop them?
I have been struggling with some of these counters as well... The flexibility of the cache protocol and the limited detail in the event descriptions makes it difficult to design tests to validate hypotheses about the behavior or the accuracy of the counts.
First question: Did you compile the STREAM benchmark with streaming stores or without? For cache-contained data you definitely don't want streaming stores, and using streaming stores completely changes the way data moves around, so it requires a very different analysis. (Streaming stores may also impact the accuracy of the counters.)
Assuming that you compiled without streaming stores....
In the first set of results above, it looks like the L2 is writing back all clean and dirty data. In this case there is no data in S state (only E or M), so the counts of L2_LINES_OUT.NON_SILENT and L2_TRANS.L2_WB should be the same (and they are) and the counts should correspond to reading (and victimizing) 3 arrays of data per iteration. 3 arrays of data is the same traffic that the Triad BW calculation assumes, so I would expect L2_LINES_OUT.NON_SILENT and L2_TRANS.L2_WB to have rates of 12.8 GB/s (and they do). The L2_LINES_IN.ALL should be counting 17.07 GB/s (12.8 GB/s going into the L2 from the L3 plus 4.27 GB/s going into the L2 from the L1), and instead it is counting approximately zero. That is not a good sign.
In the second set of results above, the DRAM read bandwidth is close to the value that you would expect for a case compiled with streaming stores -- 2/3 of 13768 MB/s is 9179 MB/s and you measured 9823 MB/s. So your measured bandwidth is either 7% high (if the code uses streaming stores) or 29% low (if the code does not use streaming stores). The DRAM write bandwidth is 25% too low, so I will assume that the code was compiled without streaming stores and that you are getting unwanted L3 hits because the arrays are not big enough to ensure that the L3 is fully flushed in each iteration. (The STREAM run rules require that each array be 4x the size of the aggregate cache available, so you need to at least double the array sizes. That is just a rule of thumb -- I often just jump immediately to N=80 million, which is the largest round number that allows all three arrays to fit in 32 bits. For larger sizes you need to add "-mcmodel=medium" or modify the code to allocate the arrays dynamically.) At least in this case you are getting plausible counts for L2_LINES_IN.ALL -- the 11.24 GB/s is about 20% low if the event is only supposed to count L2 lines in from L3+DRAM, or it is 40% low if the event is supposed to count L2 lines in from both the "outside" (L3+DRAM) and the "inside" (L1 Writebacks). The L2_LINES_OUT.NON_SILENT and L2_TRANS.L2_WB are identical again, with values large enough to confirm that the L2 is writing back both clean and dirty data -- about 15% less than expected, but way higher than the 4.59 GB/s rate expected for just the dirty data (assuming no L3 hits).
There is definitely a lot more work to do to understand this chip....
Can you double check the event id and umask that you are using for L2_LINES_IN.ALL. Event id should be 0xf1 and umask 0x1f. I have never seen any case where L2_LINES_IN.ALL is so off. It must be due to some other error
Thanks for the analysis of the STREAM values, it helped me to get a better picture of the chip. I used GCC 5.4.0 with options -O3 -ftree-vectorize. There are no nt-stores in the assembly.
And another thanks for the hint to double-check the umask. I copied the configuration of the Skylake Desktop and there the umask is just 0x7 and not 0x1f. With the changes the values look more promising. Now starts the difficult part trying to validate the counts (and hopefully find events that can differentiate between loads from L3 and loads from memory). I'll post updates here.
McCalpin, John wrote:
In the first set of results above, it looks like the L2 is writing back all clean and dirty data. In this case there is no data in S state (only E or M), so the counts of L2_LINES_OUT.NON_SILENT and L2_TRANS.L2_WB should be the same (and they are)
Out of curiosity, in what scenario would NON_SILENT and L2_WB not be the same, or said another way - what is the difference between these two events?
S lines are dropped silently, right? M lines are written back. What about E lines? They can be evicted non-silently, but don't need to be written back?
From the Intel documentation it is clear that evictions can be "silent" or "non-silent", but I have not seen any documentation of which transactions fall into each category. In the olden days this used to be a simple decision based on transaction type, but with modern Intel processors there is a fair likelihood that at least some transactions can be of either class, with the choice based on buffer occupancy or history-based predictors.
I would assume that a dirty L2 WB would always be non-silent -- especially if it is sent to the L3 (which is co-located with the CHA).
On at least one system that I have helped design, evictions of clean E state lines was non-silent. A "clean replacement notification" is sent to the directory so that it knows that no cache can have a dirty copy of the line. My current interpretation is that SKX processors provide eviction notification to the (local) snoop filter on clean E-state victims, but that they do not provide notification to the home directory on clean E-state victims that belong to remote nodes. Lots of hypotheses in this area can be tested, but it is important to be careful of context -- with dynamically adapting mechanisms, the same cache state transition may generate a completely different pattern of bus transactions depending on load and perhaps on history-based prediction mechanisms.
Notifications on evictions of S state lines are possible, but (in my experience) are not as widely used as notifications on evictions of clean E state lines. The benefits of having a more up-to-date directory (snoop filter) have to be weighed against the overhead of the additional bus traffic. In addition, some designs don't precisely track S state lines. In a large NUMA system, for example, the tracking of S-state lines may be by "node", without keeping track of how many caches in that node have a copy of the line. In such cases there may also be additional cache-to-cache copies of S state lines without notifying the directory, making it even harder to know when a line no longer has any shared copies in an entity tracked by a single bit.
I have had no luck finding any performance counter events that allow me to track writes to the L3 (measured at the L3) on SKX. LLC_LOOKUPS.WRITE counts are several orders of magnitude too low in the three tests that I have done -- LLC writes due to Snoop Filter Evictions, LLC writes due to clean L2 victims, and LLC writes due to dirty L2 victims. This makes some analyses harder to interpret unambiguously.....
Thanks Dr. McCalpin, your answer is helpful as always - even though we don't yet have a full picture of how it works as you point out.
You may find this question interesting. It was found that SKL shows WB events (not simply non-silent evictions) in a case where I'd expect no WBs: where a workload fit entirely in the L3 cache.
I have never had access to an SKL (client) processor, and have had limited access to client processors in previous generations. (I had some Xeon E3-1270 (v1) processors that I did a fair amount of work with, but that has been 5-6 years and I don't remember very much. I have access to Haswell-generation client processors in my Mac systems, but have not done any detailed performance analysis on these.)
Concerning the experiments reported at https://stackoverflow.com/questions/52565303/on-skylake-skl-why-are-there-l2-writebacks-in-a-read-on....
There certainly remains a lot of work to do to understand these processors.....
Regarding the question posted in comment #7 by Travis D., I have repeated the experiments on a Coffee Lake processor but with the following changes:
I've also disabled hyperthreading and all L1 and L2 hardware prefetchers to simplify the analysis.
Note that the L2_LINES_OUT.SILENT and L2_LINES_OUT.NON_SILENT events are documented for Skylake as shown in https://download.01.org/perfmon/index/. These also seem work on my CFL processor.
I've measured the following core events:
The last three events are measured in a separate run due to the limited number of core PMU counters.
I observed the following:
I thought of using other (offcore or uncore) performance events to count the number of L2 writebacks. To my knowledge, OFFCORE_RESPONSE cannot be used for this purpose. However, some uncore CBox events look useful:
I've made the following additional observations using these counters:
Ive made other "on the side" observations which I don't undertand: