We noticed, in dual-socket setup with *"directory mode" disabled*, some write-backs to DRAM of clean cache lines (i.e. that were only read and never modified by anyone). This "issue" is very present in case all the HW prefetchers are enabled and very rare but still present when they are all disabled.
Does anyone have any idea what causes those clean write-backs and how to prevent them ?
I noticed in "Intel Xeon Processor Scalable Memory Family Uncore Performance Monitoring" 2 IDI opcodes that look relevant: WbEFtoI and WbEFtoE.
Of course, I can come up with more details if needed.
My first guess is that these are lines that the L2 was trying to evict to the L3, but which the L3 decided not to cache. We know the behavior of the caches is dynamic, but we don't have much idea of what controls the behavior, and (as this issue suggests), we only have a coarse idea of the options available to the caches.
In general terms, it does not make any difference whether such a clean eviction is dropped or written back to DRAM, but it is certainly possible that some detail of the implementation of the snoop filters or L3 or memory directories makes writing the line all the way back to memory the preferred option.
It is also possible that this occurs for some types of eviction operations and not for others. For example, clean lines evicted from the L2 due to Snoop Filter Evictions may behave differently than clean lines evicted from the L2 due to loads into the L2.
The processor family has the unusual characteristic that the associativity of the victim cache (11-way L3) is lower than the associativity of the Snoop Filter (12-way) or the L2 caches (16-way). This can lead to pathological conflict behavior (e.g., https://sites.utexas.edu/jdm4372/2019/01/07/sc18-paper-hpl-and-dgemm-performance-variability-on-inte...), which could easily result in unexpected cache transactions (either as a deliberate performance mitigation, or as the unexpected occurrence of a "corner case" transaction that exists for obscure reasons known only to the design team).
Thank you for your answer.
Can you expand on why you think that's a L2 cache line not accepted by the L3 ?
I struggle to understand when a useless write-back to DRAM could be the preferred option and why I observe this behavior only in dual-socket. Maybe that's a corner-case that is only visible in this configuration because of timing. And I don't see the point of having a sort of RFO that marks the line as Modified instead of Exclusive, the gain would be null right ?
"It is also possible that this occurs for some types of eviction operations and not for others. " => I agree with you since evicting those cache lines by software does not raise the issue. And disabling the prefetch only mitigates the issue.
"In general terms, it does not make any difference whether such a clean eviction is dropped or written back to DRAM" => I agree with you that it is harmless for the system coherence, but not for us where our in-DRAM processors changed its content and got overwritten by those clean write-backs
I used perf tool to monitor both WbEFtoI and WbEFtoE opcodes and, though I have seen some, they do not seem to be the cause of the problem we encounter.
I'll definitely take a look at your presentation that I came across already, and I will keep digging this issue. If anything comes to your mind, please don't hesitate
My guess that this is an L2->L3 victim that the L3 did not accept was just a guess -- it has the right semantics and I have worked on processor designs that allowed for a cache to ignore a transaction (and let it pass through to the next level) if the cache was "too busy". For the case of clean data, dropping the WB would be "coherent", but that also requires a decision, and it could have been easier to just let the data bypass.
In SKX, a clean WB from the L2 should update the Snoop Filter as well, so there are possibilities for more complex situations: both SF and L3 busy, only L3 busy, only SF busy.
There are also more caches involved! Even with memory directories enabled, the coherence engine at each tile is also managing the HITME cache, the IO Directory Cache (IODC), issues related to handling Direct Cache Access (DCA), and perhaps others. These are minimally all documented.
SKX includes new mechanisms to dynamically control whether L2 victims are cache by the L3, but I don't know that there has been any detailed disclosure of how these work. In principle, the decision could be made either by the L2 or the L3 -- or both.
The core performance counter event IDI_MISC.WB_DOWNGRADE counts the "number of lines that are dropped and not written back to the L3 as they are deemed less likely to be reused shortly". My inference here is that no data WB is generated by the L2, but that there would still be a "Clean Eviction Notice" sent to the Snoop Filter. This strongly suggests that the L2 has a role in the decision to send clean victims to the L3, but that does not mean that the L3 cannot also have a role.
For ECC-enabled systems, the DRAM "scrubber" may also rewrite clean lines. On SKX, the uncore guide notes that the IMC_WRITES event does not count writes due to ECC errors. The IMC has a counter for correctable errors, but I have never tried to test it. (We immediately service systems that throw more than a few correctable errors per day, so there is little incentive to study these errors in more detail.)