Solved: Reading across NUMA nodes results in redundant memory writes

Shangyan_Zhou · ‎05-27-2020

Hello,

I have an issue with a dual socket E5-6244 system. I'm writing code for a computation task which is memory bounded. I see that when the code do memory reading across numa nodes, e.g. node 0 reads from node 1, there would be an observable redundant memory writing with size roughly equal to the reads on node 1.

I have no idea about what's going on. And how can I avoid the redundant memory writes?

Many thanks in advance.

Shangyan

McCalpinJohn · ‎05-28-2020

This is a "feature" of the Skylake Xeon and CascadeLake Xeon processors in multi-socket configurations....

The processor implements a "memory directory" feature that hides one or more bits of "directory" information in the ECC of each cache line in DRAM. The bit is used to indicate whether another socket *might* have a modified copy of the cache line. When a remote socket reads a cache line, it is typically given the line in "Exclusive" state, which allows it to modify the line without any additional notifications. The "home" socket must assume that lines sent to other sockets in "Exclusive" state could be modified, so it must change the "memory directory" bit(s) for the cache line. This requires re-writing the entire line.

This, and other features, are discussed in my presentation https://www.ixpug.org/documents/1524216121knl_skx_topology_coherence_2018-03-23.pptx

I don't know any way to avoid this extra memory traffic on SKX/CLX. Some architectures (e.g., MIPS) have the option to prefetch cache lines in the "Shared" state, which would not set the "memory directory" bit, but it does not appear that Intel's prefetch instructions have this option.

The nature of the mechanism suggests an alternate approach, which I have not tested. If a cache lines is first fetched by a *local* core, then it will be granted in Exclusive state to the *local* cache. If a load to the same line comes in from the other socket, it will be downgraded to a "Shared" state in the local cache, and a copy will be sent to the remote requester in "Shared" state. In this scenario, the remote socket is never granted Exclusive access, so the memory directory bit will not be set and the line will not need to be re-written to DRAM. Implementation would require careful synchronization to ensure that a block of lines that fits into the local cache is fully loaded before any remote loads are performed. Blocking to L2 size should be safe, but with more overhead, while blocking to L3-containable sizes will reduce overhead but introduce more uncertainty in the behavior. (I.e., the L2 *usually* sends clean victims to the local L3, but sometimes it does not, and it is difficult to understand the details or the ratios of victims that are written to L3 vs victims that are silently dropped).

View solution in original post

McCalpinJohn · ‎05-28-2020

This is a "feature" of the Skylake Xeon and CascadeLake Xeon processors in multi-socket configurations....

The processor implements a "memory directory" feature that hides one or more bits of "directory" information in the ECC of each cache line in DRAM. The bit is used to indicate whether another socket *might* have a modified copy of the cache line. When a remote socket reads a cache line, it is typically given the line in "Exclusive" state, which allows it to modify the line without any additional notifications. The "home" socket must assume that lines sent to other sockets in "Exclusive" state could be modified, so it must change the "memory directory" bit(s) for the cache line. This requires re-writing the entire line.

This, and other features, are discussed in my presentation https://www.ixpug.org/documents/1524216121knl_skx_topology_coherence_2018-03-23.pptx

I don't know any way to avoid this extra memory traffic on SKX/CLX. Some architectures (e.g., MIPS) have the option to prefetch cache lines in the "Shared" state, which would not set the "memory directory" bit, but it does not appear that Intel's prefetch instructions have this option.

The nature of the mechanism suggests an alternate approach, which I have not tested. If a cache lines is first fetched by a *local* core, then it will be granted in Exclusive state to the *local* cache. If a load to the same line comes in from the other socket, it will be downgraded to a "Shared" state in the local cache, and a copy will be sent to the remote requester in "Shared" state. In this scenario, the remote socket is never granted Exclusive access, so the memory directory bit will not be set and the line will not need to be re-written to DRAM. Implementation would require careful synchronization to ensure that a block of lines that fits into the local cache is fully loaded before any remote loads are performed. Blocking to L2 size should be safe, but with more overhead, while blocking to L3-containable sizes will reduce overhead but introduce more uncertainty in the behavior. (I.e., the L2 *usually* sends clean victims to the local L3, but sometimes it does not, and it is difficult to understand the details or the ratios of victims that are written to L3 vs victims that are silently dropped).

Shangyan_Zhou · ‎05-28-2020

Thanks for you reply! It's very helpful!

McCalpin, John (Blackbelt) wrote:
This is a "feature" of the Skylake Xeon and CascadeLake Xeon processors in multi-socket configurations....
The processor implements a "memory directory" feature that hides one or more bits of "directory" information in the ECC of each cache line in DRAM. The bit is used to indicate whether another socket *might* have a modified copy of the cache line. When a remote socket reads a cache line, it is typically given the line in "Exclusive" state, which allows it to modify the line without any additional notifications. The "home" socket must assume that lines sent to other sockets in "Exclusive" state could be modified, so it must change the "memory directory" bit(s) for the cache line. This requires re-writing the entire line.
This, and other features, are discussed in my presentation https://www.ixpug.org/documents/1524216121knl_skx_topology_coherence_201...
I don't know any way to avoid this extra memory traffic on SKX/CLX. Some architectures (e.g., MIPS) have the option to prefetch cache lines in the "Shared" state, which would not set the "memory directory" bit, but it does not appear that Intel's prefetch instructions have this option.
The nature of the mechanism suggests an alternate approach, which I have not tested. If a cache lines is first fetched by a *local* core, then it will be granted in Exclusive state to the *local* cache. If a load to the same line comes in from the other socket, it will be downgraded to a "Shared" state in the local cache, and a copy will be sent to the remote requester in "Shared" state. In this scenario, the remote socket is never granted Exclusive access, so the memory directory bit will not be set and the line will not need to be re-written to DRAM. Implementation would require careful synchronization to ensure that a block of lines that fits into the local cache is fully loaded before any remote loads are performed. Blocking to L2 size should be safe, but with more overhead, while blocking to L3-containable sizes will reduce overhead but introduce more uncertainty in the behavior. (I.e., the L2 *usually* sends clean victims to the local L3, but sometimes it does not, and it is difficult to understand the details or the ratios of victims that are written to L3 vs victims that are silently dropped).