Hello John,

Chi__Chi_Ching · ‎11-25-2017

Hi all,

I am using a dual socket 8168 platinum system with 3UPI links. My application is relatively memory intensive and has roughly 2 bytes read for every byte written. When I am using only on socket (by pinning threads to cores of only one socket) the memory behavior is normal and PCM shows the expected memory controller usage. However, when I use cores from the same socket this drastically changes, there is around 1.2:1 read/write ratio, and in total the memory controller throughput grows more than the performance. In this case especially for the writes, which roughly is 2x higher than expected.

I believe this is somehow due to the directory coherence of Broadwell and Skylake multi-socket systems. When I use early snoop on our Broadwell EP system, the memory behavior when using 2-sockets was completely in line with the 1-socket results. When using the Home Dir with OBS in Broadwell, the same memory behavior as in Skylake is observed. In Skylake unfortunately the non-directory-based coherence protocol are removed.

When I use a single process on each socket also the memory behavior is inline with expectations, with a 2:1 read/write ratio and memory traffic proportional to the application throughput. I would like to understand better what is the cause of this extra memory traffic to see what type of optimizations could improve the situation.

Best,

Chi

McCalpinJohn · ‎11-27-2017

A few questions that might help narrow down what you are seeing:

(1) Do you have a reliable estimate of what the memory traffic should be on an SKX memory hierarchy? (E.g., assuming a single thread and no cache conflicts.) Are the observed values close to these expected values?

(2) Does the code expect to get re-use from data in the shared L3? Would this re-use be expected to decrease if the L3 capacity were reduced (e.g., by sharing with other threads).

(3) According to the Xeon Scalable Processor Uncore Performance Monitoring Guide, the CHA units have a counter for directory updates, with the comment that these updates cause writes to the memory controller. Have you looked at this counter to see if the values are in the same order of magnitude as the increase in DRAM write traffic? (Directory update events are also available in the M2M box -- I have not looked at this event in detail in either of these locations.)

The CHA DIR_UPDATE counter should be able to confirm or reject the hypothesis that directory updates are responsible for the increased memory traffic, but I would not assume that this the only possibility. The SKX cache hierarchy is fundamentally different than the BDW memory hierarchy, and other cache behavior could account for this difference. For example, the L2 in SKX can evict dirty data to either the L3 or to DRAM. The core performance counter event IDI_MISC.WB_UPGRADE counts lines that are evicted from the L2 into the L3, while IDI_MISC.WB_DOWNGRADE counts cache lines that are evicted from the L2 but not put in the L3. Writebacks of dirty data from the L2 to L3 can be counted at the CHA using the LLC_LOOKUP.WRITE event.

Streaming stores are not commonly generated by the compiler, but these can certainly result in increased DRAM writes if the write-combining buffers are prematurely flushed. This should not be a significant problem with AVX512 stores (since the write-combining buffer is filled in a single instruction), but can be noticeable if the streaming stores that fill a single 64-Byte aligned write-combining buffer are spread out or if streaming stores to multiple streams are interleaved.

Chi__Chi_Ching · ‎11-27-2017

Hello John,

thank you for your response. Your insight is highly appreciated.

1) The memory traffic I observe on a single thread is in line with multiple threads up to a single socket with SKX and also with multi socket BDW with early snoop. In line here means that the traffic increases proportionally with the actual throughput of the application. The scalability of the application is good with both cores and hyper threads. Only when using multiple sockets on BDW (with Home Dir + OBS) and SKX (any UPI configuration) there is more memory traffic than expected.

2) The code has reuse in the L3 which is mostly from shared loads. There is a high chance that the same memory locations are used by several threads close in time (my guess is 2-4 times reuse on average?). The shared loads are from a big data structure (~300MB), which is bigger than the cache. Untimely reuses would result in extra memory reads, but does not explain the additional writes. There is also some producer-consumer usage, but this is around 10 times less.

3) I have not looked at this counter yet, but seems like a nice way to check if the directory updates are the cause. Is PCM able to read this counter? I used perf in the past to measure certain specific events listed in the Intel manual, but was always a hassle to get the exact syntax correct. If you are able to assist with a command line with either PCM or perf it is greatly appreciated.

We are programming with SIMD intrinsics and are generating streaming stores manually. In fact these are the only instructions in our application that should cause any memory writes. This behavior had been confirmed with all recent previous generation Intel processors. Also we are using prefetching for the reading part. I also tried disabling the prefetching and separately the streaming stores to see if this was causing the extra memory traffic, but it stayed more or less the same just with lower performance.

Best,

Chi

McCalpinJohn · ‎11-28-2017

Perf should be able to read the counter if your kernel is new enough to understand the SKX uncore. Our primary production OS (CentOS 7.3) does not know how to access the SKX uncore, but CentOS 7.4 looks like it supports the SKX uncore. Right now I am doing all my SKX performance counter work with direct MSR and PCI Config Space reads and writes. (I agree with your comments on the difficulty of figuring out the syntax of "perf stat" commands to use uncore counters or any counters that require auxiliary information -- that is why I usually do everything myself.)

I had no trouble programming the CHA counters and getting reasonable results for the events I was looking at (mostly snoop filter related). This showed some rare cases of increased L2 evictions due to snoop filter conflicts, but these were mostly picked up by the L3 and very few went to memory.

The CHA counters look like the best place to start for this analysis -- some interesting events might be:

DIR_UPDATE.HA and DIR_UPDATE.TOR
IMC_WRITES_COUNT.FULL and IMC_WRITES_COUNT.PARTIAL
- The description of this event notes that a remote RFO will result in a memory write to update the directory. Of course there will eventually be an additional memory write when the dirty line is written back to memory.
LLC_VICTIMS.LOCAL_M and LLC_VICTIMS.REMOTE_M
- The description of this event notes that it does not count "evict cleans", but that leaves me wondering why there are Umasks for LLC_VICTIMS in E and S state (which should be clean). In any case, any M state LLC evictions should generate writes to memory that are not associated with streaming stores.
MISC.WC_ALIASING
- The description of this event is obscure, but it looks like it can trigger extra writebacks, so it should probably be checked.
WB_PUSH_MTOI.LLC versus WB_PUSH_MTOI.MEM
- It looks like this counts dirty cache lines that are invalidated in an L2 cache and sent to the Home Agent. The Home Agent can keep the line in M state in the LLC or push it to memory.

Both the CHA and M2M units support opcode filtering. Although it is not entirely clear what all of these events mean, it might be useful to run a sweep of the different memory write types to see if any of the non-standard ones jump out as being highly correlated with the increase in memory traffic.

On earlier systems (e.g., BDW), the behavior of the LLC slices was typically very well balanced, and it was possible to do a broad sweep by programming different events into different CBos. I don't recommend this on SKX -- I see fairly large variations in counts across the CHA units even for homogeneous workloads, so I count the same events in all CHAs and then look for differences in the sums over all boxes.

Directory coherence memory transfers