Cross NUMA Latency in Xeon Skylake Gold

Cao__Henry · ‎07-19-2019

Hi,

Recently I noticed a weird latency increase in reading shared memory in a single-producer-multiple-consumer setup.

Initially, I had:

NUMA 1 Core 14: writer (bounded), writing ~232 bytes to a named memory segment on NUMA 1

NUMA 1 Core 15: reader (spinning and pinned)

NUMA 1 Core 16: reader (spinning and pinned)

NUMA 1 Core 17: reader (spinning and pinned)

NUMA 1 Core 18: reader (spinning and pinned)

Let's the write-then-read latency is X nano for all readers.

Then, as soon as I added a reader to core 11 (NUMA 0), the write-then-read latency for the reader on core 15-18 jumped ~800 nanoseconds.

I understand transferring data across NUMAs would introduce higher write-then-read latency for the reader on core 11. But I don't expect nor understand why the latency for readers on core15-18 also went up significantly. I wonder what happened behind the scene.

Is it because the cache coherence requires readers to "wait" till all readers (including the one on core 11) have the same cache lines to be available?

Any insight is appreciated.

McCalpinJohn · ‎07-19-2019

I have not worked through this case in detail, but a sharp increase in latency on SKX processors is not surprising -- the implementation (especially the "memory directory" feature) is optimized to minimize local access latency at the expense of remote access latency. The "memory directory" feature is discussed (in terms of bandwidth, rather than latency) in the forum thread at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/804142#comment-1933205

The single-producer/single-consumer case on a single chip with an inclusive L3 and local memory is discussed at http://sites.utexas.edu/jdm4372/2016/11/22/some-notes-on-producerconsumer-communication-in-cached-processors/ ;

The scenario becomes significantly more complex with cross-chip accesses and memory directories. Behavior (and performance) are likely to be quite different depending on the "home" location of the memory locations involved.

Your case may (or may not) be triggering operation of the "HitME" cache, which is intended to accelerate snooping for highly contended ("migratory") cache lines. The HitME cache is not particularly well documented, but is probably similar to the design described in https://patents.google.com/patent/US8631210 ; A few words from a third party are included in the paper at https://www.semanticscholar.org/paper/An-analysis-of-core-and-chip-level-architectural-in-Hofmann-Hager/055359bf3807f067db7b3518540b719759df2388

There is a lot of implicit information in the "Intel Xeon Processor Scalable Memory Family Uncore Performance Monitoring Reference Manual" (document 336274-001), especially in the "CHA Performance Monitoring Overview" in Chapter 2 and the "Packet Matching Reference(s)" in Chapter 3.

From a performance perspective, sometimes data replication (one copy in the memory of each socket) can reduce worst-case latency (though atomicity becomes more challenging).

Cao__Henry · ‎07-19-2019

Thanks for all the info.

McCalpin, John (Blackbelt) wrote:
I have not worked through this case in detail, but a sharp increase in latency on SKX processors is not surprising -- the implementation (especially the "memory directory" feature) is optimized to minimize local access latency at the expense of remote access latency. The "memory directory" feature is discussed (in terms of bandwidth, rather than latency) in the forum thread at https://software.intel.com/en-us/forums/software-tuning-performance-opti...

I replied to your first post there.

I read some article mentions the same -- "the implementation (especially the "memory directory" feature) is optimized to minimize local access latency at the expense of remote access latency". What I don't get is that why the readers doing "home access" is also getting the 800 nano penalty....

McCalpinJohn · ‎07-22-2019

What I don't get is that why the readers doing "home access" is also getting the 800 nano penalty....

I don't understand exactly what is happening in your case, but the memory directories can certainly influence local access. The memory directory bit tells the processor whether a core in another socket *might* have a dirty copy of the line. If the first reader of the cache line is remote, then it will receive the data in E state (which is allowed to become dirty), so the bit must be set. Subsequent local reads will have to snoop the other socket (and wait for the result) if this bit is set. If the system is lightly loaded, the snoop to the remote socket can be sent in parallel with the local memory request (after the local L3 & SnoopFilter miss is confirmed). This gives the lowest latency, but increases the load on the UPI links. If the system is heavily loaded (especially the UPI links), the snoop to the remote socket can be deferred until the coherence agent receives the cache line and checks the directory bit. If the directory bit is clear, no remote snoop is required (thus saving UPI bandwidth), but if the directory bit is set, a remote snoop is required, and the requesting processor cannot use the data from local memory until the response from the remote snoop is received.

Your observation of an 800 ns increase is much larger than I would expect from a single remote snoop, but I have not worked through the implications of having so many readers. It is possible that there is some serialization in the CHA. Varying the number of local readers and seeing if that changes the latency adder when adding a remote reader could shed some light on what is happening. Uncore performance counters in the CHA, UPI link layer, and M3UPI blocks might also be helpful -- but there is no guarantee that Intel has documented enough events (and implemented them correctly) to be able to understand what is happening.....

Cao__Henry · ‎01-09-2020

Sorry to come back so late.

I finally got time to write some test code.

Writer pinned to core 15 (NUMA 1) to publish 50000 messages with 5 ms gap in between and at most 4 messages per second (w/some randomness from rand()).

Readers pinned to core 13, 14, 16, 17, 18, and 19 (all NUMA 1) to read (in a tight loop) all messages but will only uses approx. 1/2 of the messages for the statistic measurement.

Measurement is done from the time the writer writes the first byte to the shared memory segment to the time the reader reads in all bytes. All timestamps are in TSC.

All cores in NUMA 0 have been isolated.

Median latency in TSC

single reader: 432

2 readers: 712

3 readers: 704

4 readers: 824

5 readers: 864

6 readers: 1048

In general it is going up. Tho, the samplings could be too little (~25000 samples each).

I haven't added the test case to test with a reader on foreign NUMA.

Tho based on the result above, it seems there is some sort of contention or the pipe for transferring data between core was saturated.

Cao__Henry · ‎01-10-2020

So I repeated the test cases above but for each test case, I moved 1 reader to NUMA 0 (foreign), and isolated core 5-11 as well.

single reader: 2000 (foreign)

2 readers: 2488 (foreign and local)

3 readers: 2472 (foreign), 2608 (local)

4 readers: 2784 (foreign), 3010 (local)

5 readers: 2736 (foreign), 2984 (local)

6 readers: 3152 (foreign), 3420 (local)

TSC is at ~3.6GHz. So it seems like 1 single foreign reader slows down everyone by ~500 nano. Thought?

McCalpinJohn · ‎01-10-2020

232 Bytes is probably 4 cache lines. It would be interesting to see how the scaling looks with a single-cache-line payload. That would tell you, for example, if there is something serializing the four cache line transfers....

With a single cache line being transferred, it might be easier to understand the results of performance counter monitoring.....

Cao__Henry · ‎01-24-2020

Hi John,

Thanks for your suggestion. I will give it a try. What performance counters do you suggest to record/sample with perf (or others)? I am on kernel 3.10 so some newer performance counters (such as the one for false sharing) might not be available.

And now, I reread your very 2nd post

McCalpin, John (Blackbelt) wrote:
If the first reader of the cache line is remote, then it will receive the data in E state (which is allowed to become dirty), so the bit must be set. Subsequent local reads will have to snoop the other socket (and wait for the result) if this bit is set.

I thought when a writer (affines NUMA 1) writes a cache line of data to the main memory (NUMA 1), the cache line will be available in the L3 cache in the same socket of CPU, no? If it is in L3 cache on the same socket, then even the first reader of that cache line is remote (another socket/NUMA 0), I thought slower readers on the same socket (NUMA 1) would try to reach from the local L3 to L1, no?

McCalpinJohn · ‎01-26-2020

After a core writes to a cache line, the only valid copy of the line in the system is in that core's L1 Data Cache. Other caches (L2, L3, Snoop Filter, etc), may have an an entry that points to the writer core's cache, but they won't have the data (because it has not been written back yet).

There are too many special cases to summarize here -- each individual specific scenario is complicated enough, even if the protocol were completely published (which it is not). At the lowest levels, understanding any non-trivial sequence of cache transactions on an Intel processor has to be treated as a research project. Understanding what is happening requires combining "official" documentation, "implicit" documentation (e.g., in the performance counter descriptions), "potentially relevant" documentation (e.g., published patents), and a customized set of microbenchmarks and performance counter observations. This is not always enough, but sometimes the results are satisfactory....