Processors
Intel® Processors, Tools, and Utilities
14528 Discussions

What's the actual behavior of L3 cache coherency across NUMA nodes in Intel CPU?

ykhan2019
Beginner
726 Views

I write a simple test with two threads. One thread accesses 32-MB array on local node in a loop, the other accesses it on remote node.

According to the following test results, it seems like that data from remote NUMA node will not be cached into local L3 cache under any case. In other words, remote accessing can only be served by REMOTE_DRAM and REMOTE_CACHE_FWD.

Intel sdm about the REMOTE_DRAM and REMOTE_CACHE_FWD

gzL0c.pngI guess bus snooping is implemented for L1/L2 cache across NUMA nodes, but not for L3 cache across NUMA nodes. There are related phrases like "cross package snoop" and remote cache" in Intel sdm without detailed definition for them. So, I'm wondering where can I find the detailed specification or proof for my testing conclusion?

 

 

Environment

 

  • Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz
  • 2 NUMA nodes
  • L3 cache size: 38 MiB

 

perf stat results

 

  1. Local access

 

 Performance counter stats for process id '2482844':

               112      mem_load_l3_miss_retired.remote_dram                                     (48.94%)
                97      mem_load_l3_miss_retired.remote_fwd                                     (65.96%)
           2166976      LLC-load-misses           #   41.82% of all LL-cache hits     (65.96%)
           5181401      LLC-loads                                                     (67.76%)
               678      node-load-misses                                              (34.04%)
           2036055      node-loads                                                    (32.24%)

       1.308500500 seconds time elapsed

 

  1. Remote access

 

 Performance counter stats for process id '2482844':

           1143866      mem_load_l3_miss_retired.remote_dram                                     (49.40%)
           2920330      mem_load_l3_miss_retired.remote_fwd                                     (66.31%)
          18063660      LLC-load-misses           #   45.68% of all LL-cache hits     (66.68%)
          39543836      LLC-loads                                                     (67.08%)
           4500262      node-load-misses                                              (33.32%)
          13366604      node-loads                                                    (32.92%)

       2.331103404 seconds time elapsed

 

  1. Remote access after stopping the local loop accessing thread

 

 Performance counter stats for process id '2482844':

           6171182      mem_load_l3_miss_retired.remote_dram                                     (48.65%)
               124      mem_load_l3_miss_retired.remote_fwd                                     (65.77%)
           6350280      LLC-load-misses           #   76.26% of all LL-cache hits     (66.19%)
           8326881      LLC-loads                                                     (67.51%)
           6076443      node-load-misses                                              (33.81%)
              1480      node-loads                                                    (32.49%)

       1.399664053 seconds time elapsed

 

0 Kudos
0 Replies
Reply