CLX SP: Weird writeback of clean cache lines

AlexGhiti · ‎09-11-2020

Hi,

We noticed, in dual-socket setup with *"directory mode" disabled*, some write-backs to DRAM of clean cache lines (i.e. that were only read and never modified by anyone). This "issue" is very present in case all the HW prefetchers are enabled and very rare but still present when they are all disabled.

Does anyone have any idea what causes those clean write-backs and how to prevent them ?

I noticed in "Intel Xeon Processor Scalable Memory Family Uncore Performance Monitoring" 2 IDI opcodes that look relevant: WbEFtoI and WbEFtoE.

Of course, I can come up with more details if needed.

Thanks,

McCalpinJohn · ‎09-14-2020

My first guess is that these are lines that the L2 was trying to evict to the L3, but which the L3 decided not to cache. We know the behavior of the caches is dynamic, but we don't have much idea of what controls the behavior, and (as this issue suggests), we only have a coarse idea of the options available to the caches.

In general terms, it does not make any difference whether such a clean eviction is dropped or written back to DRAM, but it is certainly possible that some detail of the implementation of the snoop filters or L3 or memory directories makes writing the line all the way back to memory the preferred option.

It is also possible that this occurs for some types of eviction operations and not for others. For example, clean lines evicted from the L2 due to Snoop Filter Evictions may behave differently than clean lines evicted from the L2 due to loads into the L2.

The processor family has the unusual characteristic that the associativity of the victim cache (11-way L3) is lower than the associativity of the Snoop Filter (12-way) or the L2 caches (16-way). This can lead to pathological conflict behavior (e.g., https://sites.utexas.edu/jdm4372/2019/01/07/sc18-paper-hpl-and-dgemm-performance-variability-on-intel-xeon-platinum-8160-processors/), which could easily result in unexpected cache transactions (either as a deliberate performance mitigation, or as the unexpected occurrence of a "corner case" transaction that exists for obscure reasons known only to the design team).

AlexGhiti · ‎09-16-2020

Hi John,

Thank you for your answer.

Can you expand on why you think that's a L2 cache line not accepted by the L3 ?

I struggle to understand when a useless write-back to DRAM could be the preferred option and why I observe this behavior only in dual-socket. Maybe that's a corner-case that is only visible in this configuration because of timing. And I don't see the point of having a sort of RFO that marks the line as Modified instead of Exclusive, the gain would be null right ?

"It is also possible that this occurs for some types of eviction operations and not for others. " => I agree with you since evicting those cache lines by software does not raise the issue. And disabling the prefetch only mitigates the issue.

"In general terms, it does not make any difference whether such a clean eviction is dropped or written back to DRAM" => I agree with you that it is harmless for the system coherence, but not for us where our in-DRAM processors changed its content and got overwritten by those clean write-backs

I used perf tool to monitor both WbEFtoI and WbEFtoE opcodes and, though I have seen some, they do not seem to be the cause of the problem we encounter.

I'll definitely take a look at your presentation that I came across already, and I will keep digging this issue. If anything comes to your mind, please don't hesitate

Thanks again,

Alexandre Ghiti

McCalpinJohn · ‎09-25-2020

My guess that this is an L2->L3 victim that the L3 did not accept was just a guess -- it has the right semantics and I have worked on processor designs that allowed for a cache to ignore a transaction (and let it pass through to the next level) if the cache was "too busy". For the case of clean data, dropping the WB would be "coherent", but that also requires a decision, and it could have been easier to just let the data bypass.

In SKX, a clean WB from the L2 should update the Snoop Filter as well, so there are possibilities for more complex situations: both SF and L3 busy, only L3 busy, only SF busy.

There are also more caches involved! Even with memory directories enabled, the coherence engine at each tile is also managing the HITME cache, the IO Directory Cache (IODC), issues related to handling Direct Cache Access (DCA), and perhaps others. These are minimally all documented.

SKX includes new mechanisms to dynamically control whether L2 victims are cache by the L3, but I don't know that there has been any detailed disclosure of how these work. In principle, the decision could be made either by the L2 or the L3 -- or both.

The core performance counter event IDI_MISC.WB_DOWNGRADE counts the "number of lines that are dropped and not written back to the L3 as they are deemed less likely to be reused shortly". My inference here is that no data WB is generated by the L2, but that there would still be a "Clean Eviction Notice" sent to the Snoop Filter. This strongly suggests that the L2 has a role in the decision to send clean victims to the L3, but that does not mean that the L3 cannot also have a role.

For ECC-enabled systems, the DRAM "scrubber" may also rewrite clean lines. On SKX, the uncore guide notes that the IMC_WRITES event does not count writes due to ECC errors. The IMC has a counter for correctable errors, but I have never tried to test it. (We immediately service systems that throw more than a few correctable errors per day, so there is little incentive to study these errors in more detail.)

AlexGhiti · ‎10-15-2020

Hi John,

Thanks for your explanation.

ECC scrubber was a good idea but we disabled ECC on our system.

"SKX includes new mechanisms to dynamically control whether L2 victims are cache by the L3, but I don't know that there has been any detailed disclosure of how these work. In principle, the decision could be made either by the L2 or the L3 -- or both."

=> Ok I will try to find those.

And what puzzles me most is that the problem arrives only in dual socket, I can't reproduce those clean write backs in a single socket configuration: I may be completely wrong (and at this point I have not been able to prove my point) but this seems like directory mode is still partially enabled (remember that I disabled it) and then some cache lines trigger a clean WB in order to update ECC bits that hold cache line state.

I still have some experiences to make, I'll get back to you.

Thanks again for your ideas, sorry for the response delay,

Alexandre Ghiti

AlexGhiti · ‎11-10-2020

Hi John,

I failed to find the mechanisms to dynamically control L2 victims destination: where did you hear about those mechanisms ?

I also noticed that, even if I disable ALL the HW prefetchers (all the prefetchers in MSR 0x1A4, XPT/UPI, LLC) and if I invalidate everything that is read, I still have some cache lines present in the cache at some point: do you know any other mechanism that could fetch those lines in my back ?

And final question: have you ever seen the acronym "DBP" ? It seems related to cache, but I can't find its meaning.

Thanks,

Alex

McCalpinJohn · ‎11-12-2020

The dynamic L2 victim control is implied by the documentation of the IDI_MISC.WB_UPGRADE and IDI_MISC.WB_DOWNGRADE core performance counter events in the Skylake Xeon/CascadeLake Xeon processors. (Table 19-3 in Volume 3 of the SWDM.)

Since Ivy Bridge, Intel mainstream processors have included a "next-page prefetcher" that operates on virtual addresses at the L1 Data Cache level. I am not aware of any documentation, but my testing suggests that when it predicts that memory accesses are going to continue from one 1KiB page to the next, it fetches one cache line from the next 4KiB virtual page into the L1 Data Cache. The primary benefits are to pre-load the TLB entry for the next 4KiB page and to "prime" the L2 HW prefetchers so that they will be able to start operating sooner. I don't know of any mechanism that disables this prefetcher without disabling caching completely (e.g., CR0.CD, or MTRR type of UC).

AlexGhiti · ‎11-19-2020

Hi John,

@McCalpinJohn a écrit :
The dynamic L2 victim control is implied by the documentation of the IDI_MISC.WB_UPGRADE and IDI_MISC.WB_DOWNGRADE core performance counter events in the Skylake Xeon/CascadeLake Xeon processors. (Table 19-3 in Volume 3 of the SWDM.)

Ok, so you did not mean that anything was configurable, only that it is dynamic.

Since Ivy Bridge, Intel mainstream processors have included a "next-page prefetcher" that operates on virtual addresses at the L1 Data Cache level. I am not aware of any documentation, but my testing suggests that when it predicts that memory accesses are going to continue from one 1KiB page to the next, it fetches one cache line from the next 4KiB virtual page into the L1 Data Cache. The primary benefits are to pre-load the TLB entry for the next 4KiB page and to "prime" the L2 HW prefetchers so that they will be able to start operating sooner. I don't know of any mechanism that disables this prefetcher without disabling caching completely (e.g., CR0.CD, or MTRR type of UC).

I will come up with a test to confirm your hypothesis, that was part of ours among a side-effect of the Management Engine or speculative execution.

I finally (better late than never...) found a way to measure those clean write-backs. Our DRAM architecture is a bit special: BG0 and BG1 have DRAM whereas BG2 and BG3 don't. I observed the clean writebacks on BG0 and BG1 with our in-DRAM processors in a test that only reads to BG0 and BG1 but that writes to BG2.

So I measured the number of writes seen by the IMC on BG0 only, which reflects the number of clean writebacks:

$ perf stat
-e uncore_imc_1/event=0x4,umask=0xc /       /* cas_count_write */
-e uncore_imc_1/event=0x4,umask=0x3 /      /* cas_count_read */
-e uncore_imc_1/event=0xb8,umask=0x11/  /* wr_cas_rank0 BG0 */

Performance counter stats for 'system wide':

452,963,850 uncore_imc_1/event=0x4,umask=0xc/
3,697,263,307 uncore_imc_1/event=0x4,umask=0x3/
20 uncore_imc_1/event=0xb8,umask=0x11/
0 uncore_imc_1/event=0xb9,umask=0x11/

1826.846941108 seconds time elapsed

So you can see the number of clean writebacks is only 20 on a test that run ~30min. And again, those writebacks are only observed in a dual-socket configuration.

I'll be back with other information,

Thanks again for your time.

Alex

McCalpinJohn · ‎11-12-2020

The only reference I have seen to "DBP" is in the BIOS Setup Guide for the Intel S2600 server: https://www.intel.com/content/dam/support/us/en/documents/server-products/Intel_Xeon_Processor_Scalable_Family_BIOS_User_Guide.pdf

The context of this option in the BIOS "Processor Configuration" menu suggests that it is related to caching, but that may be an incorrect inference.

I see the option in the BIOS configuration file on my S2600 server, where the setting is "Disabled".

The document refers to MSR 0x792, which is not accessible on my S2600 servers. It is possible that setting the option to "Enabled" and rebooting would make MSR 0x792 available, but without having some idea of what it does, I don't see any way to make forward progress....

A different approach that might get lucky would be to look in the Intel patent portfolio for a phrase that maps to "d---- b----- p-----". Pretty easy regex on plain text -- not sure how to do in on a PDF or HTML source....

When an MSR is not accessible, I usually assume that the MSR is related to a feature that the BIOS has disabled (e.g., HWP -- hardware-coordinated P-states), or that is is related to the setup of a feature that cannot be changed on a live system (e.g., mapping of physical addresses to DRAM channels/ranks/banks). (In the latter case there is usually (not always!) a different way to *read* the configuration information after the system has booted.)

AlexGhiti · ‎11-19-2020

Hi John,

@McCalpinJohn a écrit :
The only reference I have seen to "DBP" is in the BIOS Setup Guide for the Intel S2600 server: https://www.intel.com/content/dam/support/us/en/documents/server-products/Intel_Xeon_Processor_Scalable_Family_BIOS_User_Guide.pdf
The context of this option in the BIOS "Processor Configuration" menu suggests that it is related to caching, but that may be an incorrect inference.
I see the option in the BIOS configuration file on my S2600 server, where the setting is "Disabled".
The document refers to MSR 0x792, which is not accessible on my S2600 servers. It is possible that setting the option to "Enabled" and rebooting would make MSR 0x792 available, but without having some idea of what it does, I don't see any way to make forward progress....
A different approach that might get lucky would be to look in the Intel patent portfolio for a phrase that maps to "d---- b----- p-----". Pretty easy regex on plain text -- not sure how to do in on a PDF or HTML source....

Thanks for taking the time to find something about that, I came across this term in some documentation, maybe that's irrelevant.

When an MSR is not accessible, I usually assume that the MSR is related to a feature that the BIOS has disabled (e.g., HWP -- hardware-coordinated P-states), or that is is related to the setup of a feature that cannot be changed on a live system (e.g., mapping of physical addresses to DRAM channels/ranks/banks). (In the latter case there is usually (not always!) a different way to *read* the configuration information after the system has booted.)

This comment raises another question: you're saying that some MSR contains the physical address translation into DRAM address. Does that mean this translation is configurable ? I mean, of course not in a live system, but before the memory is initialized.

Thanks again,

Alex

McCalpinJohn · ‎11-19-2020

There are lots of configuration options for how physical addresses are mapped to DRAM controller, channel, rank, bank, row, column. Most of these are documented as being accessed in PCI configuration space, but sometimes these registers also have MSR aliases (not necessarily accessible after boot time).

On some systems the PCI configuration space bits that control physical address to DRAM mapping are hidden by the BIOS (grumble, grumble -- at least I was able to read the values using the BIOS shell before the OS booted)... More commonly they are readable, but not always documented in public.....