Solved: Are there any available counters for LLC-out prefetch traffic on recent Architectures?

hiratz · ‎09-25-2018

Hi John and Other Intel experts,

1

Across so many Intel multicore architecture generations, many L2-prefetchers related PMU events are exposed, which is so good! However, I did not find any events which can count the prefetch traffic leaving LLC for both per-core level and the whole cpu level. I believe they are very useful for the analysis of off-chip memory bandwidth occupation.

(Note: all cited PMU event names below are from Intel 64 and IA-32 Architectures SDM Volume 3B Order Number: 253669-067US May 2018)

"LLC_MISSES" do count the LLC-out traffic. However, as John pointed out in this post (https://groups.google.com/a/icl.utk.edu/forum/#!topic/ptools-perfapi/fLw-L6k-7j8), it does not count the traffic due to L2 hardware prefetches that miss LLC (or L3). Another possible event is OFFCORE_RESPONSE_0. Some subfields of it look related to prefetch on LLC: PF_LLC_DATA_RD, PF_LLC_RFO, PF_LLC_IFETCH (Table 18-16), but in fact they just count those prefetch traffic arriving at LLC from per-core L2 cache, not the traffic that leave LLC. Even so, I still found that there are some subfields in early 2nd Generation Intel core architecture which seems close to the LLC-out traffic (Table 19-16):

OFFCORE_RESPONSE.PF_LLC_CODE_RD.LLC_MISS.DRAM_N

OFFCORE_RESPONSE.PF_LLC_DATA_RD.LLC_MISS.DRAM_N

OFFCORE_RESPONSE.PF_LLC_RFO.LLC_MISS.DRAM_N

But I found they are disappeared in the following architectures after 2nd Generation. Not sure whether they are "hidden but still available" or "removed because of some deficiencies". I just can not find them in the SDM manual for recent architectures.

For some early processors, like the quad-core Intel Q9550 processor, it only supports two cache levels and has two separate LLC (L2) caches. There are two events "L2_LINES_IN:DEMAND" and "L2_LINES_IN:PREFETCH" that can count the demand and prefetch traffic respectively between LLC and memory. But for the 3-cache-level architectures nowadays, it seems the counters that count the prefetch traffic between LLC and memory do not exist any more.

So my first questions is what the title says: Are there any available counters for LLC-out prefetch traffic on recent Architectures based on per-logical cpu and/or the whole cpu, respectively?

2

My second question is still about OFFCORE_RESPONSE_0. Besides the above-mentioned "PF_LLC_DATA_RD, PF_LLC_RFO, PF_LLC_IFETCH", there are also three L2-prefetchers-related subfields: PF_DATA_RD, PF_RFO, PF_IFETCH (still Table 18-16), which count the prefetch traffic generated by L2 prefetchers. On the other hand, the following events count the prefetch requests on the L2 cache side (I take the event names in the 5nd Generation for example (Table 19-7) because this is the platform I'm using now, and I will discuss the same events in later generations, which seem a little different)

L2_RQSTS.L2_PF_HIT (event: 24, mask: 50) (Descriptions: Counts all L2 HW prefetcher requests that hit L2)

L2_RQSTS.L2_PF_MISS (event: 24, mask: 30) (Counts all L2 HW prefetcher requests that missed L2)

L2_RQSTS.ALL_PF (event: 24, mask: F8) (Counts all L2 HW prefetcher requests.)

Compare the above L2_RQSTS.xxx and the OFFCORE_RESPONSE_0:PF_xxx, is it true for the following estimations theoretically?

Sum of L2_RQSTS.L2_PF_MISS per logical cpu == PF_LLC_DATA_RD + PF_LLC_RFO + PF_LLC_IFETCH

Sum of L2_RQSTS.ALL_PF per logical cpu == PF_DATA_RD + PF_RFO + PF_IFETCH

(Note that the L2_RQSTS.L2_PF_HIT will be canceled and not arrive at LLC If I understand this correctly just from these events' names)

3

My third question is about L2_RQSTS.xxx I mentioned in above section 2. Note their descriptions and If I understand them correctly, they should just count the requests from L2 prefetchers themselves and do not include those prefretch requests from L1 prefetchers (Actually the L1 prefetch requests that arrive at L2 are also viewed as the "demand" ones even though they are not)

However, these events seems replaced with different ones in newer 6th, 7th and 8th architectures, see below (Table 19-4):

L2_RQSTS.PF_HIT (event: 24, mask: D8) (Descriptions: Prefetches that hit L2.)

L2_RQSTS.PF_MISS (event: 24, mask: 38) (Requests from the L1/L2/L3 hardware prefetchers or load software prefetches that miss L2 cache.)

L2_RQSTS.ALL_PF (event: 24, mask: F8) (All requests from the L1/L2/L3 hardware prefetchers or load software prefetches.)

The first two have different masks and descriptions from the ones from the 4th and 5th generations. So my questions are as follows:

(1) Is it true that L2_RQSTS.PF_MISS counts all types of prefetch requests missing L2 and L2_RQSTS.L2_PF_MISS just count the ones that missed L2 and are generated only by L2 hardware prefetchers themselves?

(2) Are L2_RQSTS.L2_PF_HIT/MISS still available on 5th, 7th and 8th generation architectures? (Their masks are not reused by other events)

(3) L2_RQSTS.ALL_PF keep the same event and mask across 4th, ..., 8th generations. So does it have the same meaning all the time, that is, counting all types of prefetch requests? Or, does it count only L2 HW prefetch requests on 4th and 5th ones but all types of prefetch requests on 6th, 7th and 8th (even its event and mask do not change)? I have to say some descriptions are confusing.

4

In the post "Understanding L2 Miss performance counters" (https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/520331), there is an answer by Vish Viswanathan (Intel):

"

Xeon processors support 2 forms of L2 streaming prefetches. In one case, the data will be fetched into L2. In the other case, the data will only fetched into L3. This 2nd case is also known as LLC prefetch (or L3 prefetch) though it is still initiated by L2.

Haswell PMU has a bug and it can't count whether LLC prefetches hit in LLC or miss LLC. However, L2_RSQTS.MISS will count those. That is why you are seeing the difference. If you disable L2 prefetcher, then your numbers should match

"

He mentioned two working mechanisms for L2 streaming prefetchers. According to his descriptions, it looks like the data will bypass L3 and directly be fetched into L2 in the first case, and the data will be fetched into L3 in the second case. However, according to the following statements in Chapter 2.4.5.4 in "Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-040, April 2018",

"The streamer and spatial prefetcher prefetch the data to the last level cache. Typically data is brought also to the L2 unless the L2 cache is heavily loaded with missing demand requests."

"When cache lines are far ahead, it prefetches to the last level cache only and not to the L2. This method avoids replacement of useful cache lines in the L2 cache."

the data can not bypass L3 and be only fetched into L2, and only both "be fetched into both L3 and L2' and "be fetched into only L3" exist. Did I misunderstand the above answer?

In addition, "However, L2_RSQTS.MISS will count those." seems to mean it can count "whether LLC prefetches hit in LLC or miss LLC." in the previous statement. How can it differentiate the the LLC-hit LLC prefetchers from the LLC-miss ones? I'm so confused.

BTW, I think the name "LLC prefetch (or L3 prefetch)" in the answer is really confusing. It leaves an impression that there exists a LLC hardware prefetcher which is bringing the data to LLC.

Thanks in advance

McCalpinJohn · ‎09-27-2018

Congratulations, grasshopper, you are now well into the "on-your-own" zone for Intel performance counters....

(1) Yes, there are separate programmable performance counters for each CHA, so you need to sum the counts (or differences between successive counts) across all CHAs to get a value for the chip as a whole. Addresses are hashed around the chip, so there is no relation between "core 0" and "cha 0". The cores and CHAs are also numbered in completely different ways, so "core 0" and "CHA 0" may not even be next to each other (see my presentation at https://www.ixpug.org/documents/1524216121knl_skx_topology_coherence_2018-03-23.pptx).

(2) Different companies have different ways of talking about cache states. For these counters, Intel uses the "I" state to indicate a cache miss. (I think this is a really bad idea -- there is a world of difference between finding a tag match on a line that is in "Invalid" state and not finding a tag match (i.e., a "miss")). But for this case MESF are hits and I is miss.

(3) Table 2-18 in the Xeon E5 v4 uncore performance monitoring guide includes some weird states that don't seem to be otherwise documented. Sometimes you can find information in the other sections -- we don't have many Xeon E5 v4's, so I have not looked at this before. (The Xeon E5 v3 has 2 "M" states in the corresponding table -- I don't know what those mean either.)

(4) The "SF" filter entries on SKX are for the Snoop Filters. I have not tried these filter bits, but there are other CHA performance counter events related to the Snoop Filters that are consistent with other counter events.

(5) The PrefRFO, PrefCode, and PrefData entries in the opcode table for Xeon E5 v4 all include the "don't pass to L2" comment. The L2 HW prefetcher can issue prefetches to L3 or to L3+L2. It is not clear whether a "hardware prefetch to L3+L2" is included in any of the documented opcodes. On Skylake Xeon, the L3 is not inclusive, so it is possible to prefetch to the L2 without also putting the data in the L3. Some of my SKX systems have a BIOS option to enable prefetching into the L3 cache (and not the L2), but we don't usually enable this mode. (On SKX, the L3 is much smaller and usually serves as a "victim cache" for the L2, rather than as the big giant shared cache (as on Xeon E5 v1,2,3,4).)

View solution in original post

McCalpinJohn · ‎09-26-2018

There are a lot of questions here, and I am pretty sure I don’t understand exactly what you are asking….

L2 HW prefetch events in the core are generally limited to seeing whether the HW Prefetch hits or misses in the L2 cache. If it misses in the L2 cache (which is the desired case), the HW prefetch is placed on the ring bus and counted as a miss. Whether it hits or misses in the L3 requires a different type of tracking, such as the tracking provided by the OFFCORE_RESPONSE event or by the CBo counters.

The architectural event LLC MISSES has slightly different definitions on different platforms, making it a bit challenging to know whether it is actually accurate. Last week I found cases where it undercounts significantly on KNL, for example. On Xeon Scalable Processors, the description in Chapter 19 has a lot more words and says that the event includes L2 HW prefetches, but it is not one of the events that I typically monitor.

The OFFCORE_RESPONSE events have different capabilities and different bugs on different platforms. It is quite difficult to keep track of all the combinations.

A note on “Part 4” — in processors before Xeon Scalable processors, it is not possible for prefetches to “bypass the L3” because the L3 is inclusive. All fetches will put the data in the L3 cache, but only some of the L2 HW prefetches will bring the data into the L2 cache. The L2 HW prefetcher is extremely dynamic. In my experiments on Xeon E5 v3, at the beginning of each 4KiB page, the L2 HW prefetcher issues the “fetch to L2” version (which will also put the data in L3), but by the middle of the page, the L2 has reached a high enough level of buffer/queue utilization that it switches to “fetch to L3” for the remainder of the page. The details of how these prefetches are controlled are undocumented and it is not clear that it is possible to infer them from measurements.

By the way, I agree that the term “LLC Prefetch” (or “L3 Prefetch”) is confusing. It is an L2 HW prefetch that brings the data into the L3. These are commonly generated by the L2 “streamer” hardware prefetcher — I don’t know if they are also generated by the L2 “adjacent line” prefetcher.

hiratz · ‎09-26-2018

Hi John,

Thanks for your reply, and I'm sorry for not describing my questions clearly ...

Simply put, only one important question here is what I want to know: is there any way to collect the prefetch traffic that leaves LLC and arrives at memory? If there is, we can calculate the percentage of prefetch traffic in the total traffic (or bandwidth) between LLC and memory and observe how shared resource is occupied by prefetching and apply some control. Furthermore, if we know per-core prefetch traffic that leaves LLC, that would be much better and allows us to apply a more fine-grained control.

McCalpinJohn · ‎09-26-2018

Depending on the platform, you might be able to get the global (not per-thread) information about HW prefetches that miss in the L3 using the uncore performance counters.

For Xeon E5 v4 (Broadwell EP), you can use a combination of opcode matching and filtering to find lots of information. In the Xeon E5 v4 uncore performance monitoring guide (document 334291-001), Table 2-18 shows the filter register fields that allow you to select which L3 cache states to count (e.g., hit or miss), Table 2-19 shows how to turn on opcode matching, and Table 2-20 shows the opcode that can be matched. It is not clear how many transaction types might be missing from this table... :-(

For Xeon Scalable processors, the opcodes are different and the non-inclusive L3 makes the behavior different as well. It looks like you should be able to get what you need there, but I have not had time to look through the details. The uncore performance monitoring guide is document 336274, but I think you need to search for it by name and not by number....

hiratz · ‎09-26-2018

Hi John,

Thanks for the documents. I read the CHA monitoring part (Caching/Home Agent) in them. For document 334291-001 (Xeon 2600 V4 Series), I noticed that each CBox is associated with a LLC slice and contains 4 general uncore counters (Table 2-13). If I understand it correctly, each CBox just collects the statistics in its local slice and we need to sum the counters in all Cbox if we want to get some LLC-level metrics like LLC miss rate, right?

In Table 2-18, the item “state” in the filter0 register shows 7 states. But I know Intel LLC uses a MESIF protocol to maintain its coherence. What do the remaining two ones mean (M’/D state)? Also, how can I identify a hit or miss just via these states? Alternatively, could I calculate the miss rate in a different way by directly using the uncore events “LLC_LOOKUP” and “LLC_VICTIMS” in the table in Chapter 2.3.4?

Table 2-20 shows opcode match and I found three ones that are related to prefetch: PrefRFO, PrefCode and PrefData. But their definitions look confusing. For example, the definition of PrefData is “Prefetch Data into LLC but don’t pass to L2. Includes hints”. Does it mean that it only collects the prefetched data brought into L3? If so, it just counts a fraction of the total prefetche traffic … Additionally, what does the "hints" here mean?

For Skylake-SP (document 336274), there are no major changes. The names of the registers are changed to a new prefix “Cn_” instead of previous “CBo_”. In the definition of filter0 register (Table 2-54), it adds more states about SF. I guess it means “snoop filter”, right? The LLC states M’ and D I mentioned above are removed.

About its opcode match in Table 3-1, there are three events similar to “PrefRFO, PrefCode and PrefData”: “LLCPrefRFO”, “LLCPrefCode” and “LLCPrefData”, but there is no statement like “don’t pass to L2” in the definition. Not sure that they have the same meaning as the Xeon v4 series.

McCalpinJohn · ‎09-27-2018

Congratulations, grasshopper, you are now well into the "on-your-own" zone for Intel performance counters....

(1) Yes, there are separate programmable performance counters for each CHA, so you need to sum the counts (or differences between successive counts) across all CHAs to get a value for the chip as a whole. Addresses are hashed around the chip, so there is no relation between "core 0" and "cha 0". The cores and CHAs are also numbered in completely different ways, so "core 0" and "CHA 0" may not even be next to each other (see my presentation at https://www.ixpug.org/documents/1524216121knl_skx_topology_coherence_2018-03-23.pptx).

(2) Different companies have different ways of talking about cache states. For these counters, Intel uses the "I" state to indicate a cache miss. (I think this is a really bad idea -- there is a world of difference between finding a tag match on a line that is in "Invalid" state and not finding a tag match (i.e., a "miss")). But for this case MESF are hits and I is miss.

(3) Table 2-18 in the Xeon E5 v4 uncore performance monitoring guide includes some weird states that don't seem to be otherwise documented. Sometimes you can find information in the other sections -- we don't have many Xeon E5 v4's, so I have not looked at this before. (The Xeon E5 v3 has 2 "M" states in the corresponding table -- I don't know what those mean either.)

(4) The "SF" filter entries on SKX are for the Snoop Filters. I have not tried these filter bits, but there are other CHA performance counter events related to the Snoop Filters that are consistent with other counter events.

(5) The PrefRFO, PrefCode, and PrefData entries in the opcode table for Xeon E5 v4 all include the "don't pass to L2" comment. The L2 HW prefetcher can issue prefetches to L3 or to L3+L2. It is not clear whether a "hardware prefetch to L3+L2" is included in any of the documented opcodes. On Skylake Xeon, the L3 is not inclusive, so it is possible to prefetch to the L2 without also putting the data in the L3. Some of my SKX systems have a BIOS option to enable prefetching into the L3 cache (and not the L2), but we don't usually enable this mode. (On SKX, the L3 is much smaller and usually serves as a "victim cache" for the L2, rather than as the big giant shared cache (as on Xeon E5 v1,2,3,4).)

hiratz · ‎09-28-2018

Thanks John, your presentation is really helpful to me!

I agree with you that it is really confusing and even misleading to use the "I" state to indicate a cache miss. Since you are also not clear about the "M'" state and "D" state in Table 2-18 for Xeon E5 v4, should I ignore them (just do as you said "MESF are hits and I is miss")? I still have some concerns about its inaccurate counting without considering these two weird states ...

For Skylake-SP, according to your answer (5) and the following document (https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview), its LLC is definitely a victim cache just for demand requests from the cores. Previously I thought traffic to LLC is from the cache lines' eviction from L2, but now obviously it also accepts the prefetch traffic from memory. The path should be be: prefetch requests go to DRAM directly from L2 and come back to L3 or L2 (but not both) with the data.

I read your presentation carefully and have some thoughts and questions. I may misunderstand something. Hope you can point them out. Thanks!

About memory directory, is it a per-DRAM hardware structure sitting inside each IMC? If so, Skylake-SP should have two directories because there are two IMCs. And the number of the entries in each directory should be the same as the number of the LLC lines in its sub-NUMA cluster (SNC), right?

Quote (page 8): “tell the processor whether another socket might have a dirty copy ...”, here “might” means the mis-judgement could happen ( it is viewed as a clean but actually it is not). So how can the correctness be guaranteed?

In another scenario, if a cache line has multiple copies in multiple sockets and one of them is dirty, each memory directory in DRAMs attached to these sockets must have one entry for this cache line and all these entry copies must conform to consistency, which should be implemented by some internal mechanism, right?

Quote (Page 10): “Remote reads can update the directory, forcing the entire cache line to be written back to DRAM”. I think here the “DRAM” should be remote DRAM and the remote directory update should cause all other directories (could be in different sockets) that have the cache line to be updated also (as I guess above).

Additionally, what does “Intervention” latency means here?

Page 11: According to some papers I read and the above document I list, I believe some intelligent mechanisms like cache block reuse detection or dead block elimination could be used in the microarchitecture level. So no uses can control this.

Page 13: I think the entries in snoop filter should be equal to the number of the lines in L1+L2, right? When a new data block is brought to L2, it also needs to be added into the Snoop Filter and could replace an existing entry if the snoop filter is full. But this means this new block still needs to go to the filter in L3, which sounds contradictory to “it is directly brought into L2”. Did I misunderstand something?

Additionally, I think if each L2 line eviction definitely brings a new line, there should no need to issue a update request for the snoop filter when this eviction happens because of the working mechanism I mentioned in last paragraph (if my guess is correct). If the L2 eviction does not bring a new line (I cannot think of such a scenario), the request for filter update should be necessary.

Page 16: When the LLC/CHA below the IMC tries to go up (Y direction) to another core, can it go directly (that is, bypass the hop of IMC) to that core with one hop or it cannot and needs to go to IMC first and then go to that core (with 2 hops)?

PACID6 makes it possible to construct various core topologies that shows different system performance. You find 5 of them have slow STREAM copy performance, which is a good work!

I haven’t read the contents about KNL tile numbering and probably read it later.

Finally, what does the "HitME" exactly mean? All uncore documents do not describe its function in detail.

McCalpinJohn · ‎10-05-2018

I don't think that the LLC on the Xeon Scalable Processors is only a victim cache for "demand" accesses. The wording in https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview does not appear to be intended to distinguish between demand and hardware prefetch activity. (The document does not include the term "prefetch", for example.)

The interaction of the L2's with the shared L3 is complex and includes dynamically-adapting mechanisms that take into account transaction type, data home, (probably) buffer occupancy, and (possibly) history. From my limited testing, it appears that dirty L2 victims, L2 victims generated by snoop filter interventions, and L2 victims with remote home nodes have a very high probability of being sent to the L3 (rather than dropped if clean or sent to memory if dirty). Clean victims that are locally home are sometimes sent to the L3 and are sometimes dropped, depending on factors that are not completely clear. The heuristics might include history-based predictors of whether the line is likely to be re-used from the L3 before being victimized from the L3, but the details are not documented.

The memory directory information is almost certainly stored in spare bits in the ECC fields of each cache line. If a cache line is sent to another socket in the M or E states, the directory is updated to indicate that the remote socket must be snooped (because it might have written to the cache line). This typically causes the local memory controller to have to write the cache line back to local DRAM -- the data has not changed, but the directory bit(s) have been modified, so as soon as the memory controller needs to reuse the buffer for that line, it must generate a new ECC value (due to the modified directory bit(s)) and write the cache line back to local memory.

The directory bits can be cleared in a variety of scenarios. If a cache line is sent to a local core in M or E state, or if a remote socket writes back a dirty line, then the directory is updated to indicate that remote sockets do not need to be snooped on a read. An implementation might also send "clean eviction notifications" from the remote socket to the home if an E-state line is dropped without being modified. I don't think that Intel does this. These notifications would come too late to prevent the local memory controller from writing the line back to DRAM and would actually force the line to be read and written again.

If the directory bit is set, this simply means that local reads must snoop the remote chip. If the bit is set incorrectly (e.g., after a clean eviction of an E-state line), then the only penalty is a remote snoop latency.

hiratz · ‎10-18-2018

Hi John,

Thank you for the detailed explanation! I'm sorry about my late reply because I was so swamped recently!

I found a pretty interesting paper that reverse engineered the non-inclusive LLC in Skylake-X/SP and found there is a hidden inclusive directory structure. This directory is also sliced and operates alongside the non-inclusive LLC. The related sections is in "REVERSE ENGINEERING THE DIRECTORY STRUCTURE IN INTEL SKYLAKE-X PROCESSORS" in page 6. Fig.9 shows a overall diagram. (Not sure if the "directory structure" in this paper is exactly the same as the "memory directory" you mentioned in your slides)

This is the link: http://iacoma.cs.uiuc.edu/iacoma-papers/ssp19.pdf (Attack Directories, Not Caches: Side-Channel Attacks in a Non-Inclusive World)

In addition, in the "B. Slice Hash Function" in "Appendix", they reverse engineered part of the slice hash function in Skylake-X processor.

Overall, a lot of inside details are revealed. Hopeful this is also useful to you (if you have not read it yet).

Best