topic Hi, in Software Tuning, Performance Optimization & Platform Monitoring

Memory bandwidth on a NUMA system

Ilya_M_1 — Thu, 03 Dec 2015 16:52:58 GMT

Hi,

I'm looking into memory performance results on a Xeon E5-2620V3 system with 2 NUMA nodes and 2 QPI links in between. With NUMA enabled in the BIOS, the Memory Latency Checker tool reports 44GB/s local throughput and 6GB/s remote, which looks too low.

                Numa node
Numa node            0       1
       0        44266.2  6004.0
       1         5980.9 44311.9

With NUMA disabled (which results in cache line interleaving AFAIU), the combined throughput is ~40GB/s. PCM shows an increased QPI traffic in this mode. So I would expect the figure to be somewhere in the middle between 44GB/s and 6GB/s with NUMA on.

        Memory node
 Socket      0       1
     0  39537.2 39588.7
     1  39515.2 39527.0

Any ideas?
I'm also curious to know how the tool (mlc) measures the bandwidth? Does it rely on PMU counters, or does it just count the memory ops from the standpoint of a client?

Thanks,
Ilya

Hi,

Krishnaswa_V_Intel — Thu, 03 Dec 2015 17:07:24 GMT

Hi,

Low core count parts have lower remote memory b/w in the default configuration which uses early snoop mode. However, if you change the snooping mode to "Home snoop" through BIOS, you will see much higher remote memory b/w. There will be a slight increase in the latency but b/w would improve

Vish

I also saw unusually low

McCalpinJohn — Thu, 03 Dec 2015 20:28:56 GMT

I also saw unusually low numbers for remote accesses on a low-core-count Haswell EP (Xeon E5-2603 v3), but I no longer have access to the system to check the BIOS snoop configuration. The 2603 is even slower than the 2620 -- lower core clock rates and lower DRAM clock rates, with the Intel Memory Latency Checker delivering only 4.7 GB/s between sockets for the default (all reads) case.

The performance varies a bit with access pattern, but not by huge amounts, so the non-NUMA results you obtained are almost certainly not possible. If the data were actually interleaved between the sockets, the 39.5 GB/s you got in the second case would mean almost 20 GB/s from the local memory plus almost 20 GB/s across QPI from the remote memory. This is more than 3x the value that was measured directly, so it seems implausible.

You will have to check your BIOS documentation to be sure, but I don't think that disabling NUMA generates cache-line interleaving. You are much more likely to get no NUMA information, which will lead to uncontrollable pseudo-random page placement.

It is difficult to tell exactly what the Intel Memory Latency Checker is doing for each test. I know that it internally changes the CPU frequency to the maximum value and enables the hardware prefetchers if they are disabled, so it might also be obtaining NUMA information from the hardware that the OS is not aware of.

To test performance for interleaved memory, I recommend using a code that is more transparent, such as the STREAM benchmark (http://www.cs.virginia.edu/stream/). If you have the Intel compilers you can configure this to use streaming stores or to avoid streaming stores. With gcc you will not get streaming stores (except at "-O3" where the compiler will replace the "Copy" kernel with an optimized library routine that uses streaming stores).

A useful set of numbers might come from:

icc -O3 -xAVX2 -openmp -ffreestanding -DSTREAM_ARRAY_SIZE=80000000 -opt-streaming-stores always stream.c -o stream.nta.exe

export OMP_NUM_THREADS=6

export KMP_AFFINITY=verbose,compact          # change to "verbose,compact,1" if HyperThreading is enabled

numactl --membind=0 ./stream.nta.exe                 # all local accesses on socket 0, streaming stores enabled

numactl --membind=1 ./stream.nta.exe                 # all threads on socket 0, all data on socket 1, with streaming stores

export OMP_NUM_THREADS=12

export KMP_AFFINITY=verbose,scatter

./stream.nta.exe                                                     # use both sockets, all accesses should be local, with streaming stores

numactl --interleave=0,1 ./stream.nta.exe             # use both sockets, memory alternates between sockets by 4KiB page

icc -O3 -xAVX2 -openmp -ffreestanding -DSTREAM_ARRAY_SIZE=80000000 -opt-streaming-stores never stream.c -o stream.alloc.exe

export OMP_NUM_THREADS=6

export KMP_AFFINITY=verbose,compact             # change to "verbose,compact,1" if HyperThreading is enabled

numactl --membind=0 ./stream.alloc.exe               # all local accesses on socket 0, no streaming stores

numactl --membind=1 ./stream.alloc.exe               # all threads on socket 0, all data on socket 1, no streaming stores

export OMP_NUM_THREADS=12

export KMP_AFFINITY=verbose,scatter

./stream.alloc.exe                                                   # use both sockets, all accesses should be local, no streaming stores

numactl --interleave=0,1 ./stream.alloc.exe           # use both sockets, memory alternates between sockets by 4KiB page

If my testing on the Xeon E5-2603 v3 is any indication, your results using all the cores on a single socket and memory interleaved across the two chips should be somewhere in the range of 10 GB/s -- about 5 GB/s from local memory and about 5 GB/s over QPI from the other socket. This assumes a slightly lower QPI than the 6 GB/s you reported based on my observations that streaming stores are unusually slow on these low-frequency parts. On my Xeon E5-2603 v3, the Intel Memory Latency Checker showed a pattern like:

Using Read-only traffic type
                    Memory node
     Socket         0           1
    0   26893.0   4713.0
          1       4688.6   27011.7

Using "-W5" (one read and one write)

                     Memory node
    Socket         0          1
    0   34972.4   5613.6                    <-- remote is about 20% faster than all reads
    1   5627.9   34708.2

Using "-W8" (one read and one non-temporal write)

      Memory node
    Socket         0           1
       0   22906.1   3740.9                <-- remote is about 20% slower than all reads
       1   3765.2   22984.6

Dr. McCalpin,

Krishnaswa_V_Intel — Thu, 03 Dec 2015 21:51:57 GMT

Dr. McCalpin,

MLC does not change the frequency. It only eanbles all the hw prefetchers for bandwidth measurements and disables them if latency is measured. In the case of measuring cross-socket b/w, it will use all the cores on a socket and allocate memory on the other socket. Basically, a thread will attach itself to a core on the other socket, allocate memory, do first touch (to assure memory is coming from the other socket) and then pin to cores on the first socket and access the allocated memory. This will result in all references going to the remote socket assuming NUMA is enabled.

However, if NUMA is disabled, then the system would do 64 byte interleaving across sockets. That is, address x would be on socket0 while address x+64 would be from socket1. MLC does not do anything differently in this case. It would still have threads pinned to other socket, allocate memory and do a first touch. However, the memory allocated would come equally from both sockets as NUMA is disabled

The result that CodeMonkey is reporting for numa disabled is still possible as the resources to track the requests are differently allocated based on NUMA on or off. Basically the caching agent has resources that are used to track outstanding requests (misses). Requests to remote memory take longer than local memory and we typically want to provide more local memory b/w. Using various considerations, the resources are divided differently between local and remote requests. And this division is different between NUMA enabled vs disabled.

Vish

Vish, John,

Ilya_M_1 — Sun, 06 Dec 2015 14:28:48 GMT

Vish, John,

Thanks for your elaborated replies!
I disabled Early Snooping in the BIOS and the bandwidth skyrocketed, paying around 10% penalty in latency, as Vish noted.

Intel(R) Memory Latency Checker - v3.0
Measuring idle latencies (in ns)...
Numa node
Numa node 0 1
0 85.7 128.9
1 128.0 85.6

Measuring Peak Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 86216.5
3:1 Reads-Writes : 86969.6
2:1 Reads-Writes : 88117.0
1:1 Reads-Writes : 87380.4
Stream-triad like: 80178.1

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Numa node
Numa node 0 1
0 43272.7 25336.9
1 25346.7 43259.0

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay (ns) MB/sec
==========================
00000 150.63 86501.6
00002 150.88 86481.6
00008 149.23 86652.2
00015 148.95 86548.9
00050 115.06 73755.5
00100 104.84 56534.0
00200 96.78 31569.2
00300 93.22 21907.8
00400 91.18 16864.0
00500 90.30 13768.8
00700 89.12 10153.8
01000 88.35 7381.2
01300 87.46 5870.2
01700 87.36 4674.6
02500 87.43 3419.3
03500 86.27 2664.8
05000 86.09 2091.0
09000 85.86 1494.9
20000 85.89 1082.7

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT latency 31.8
Local Socket L2->L2 HITM latency 36.5
Remote Socket LLC->LLC HITM latency (data address homed in writer socket)
Reader Numa Node
Writer Numa Node 0 1
0 - 77.8
1 78.0 -
Remote Socket LLC->LLC HITM latency (data address homed in reader socket)
Reader Numa Node
Writer Numa Node 0 1
0 - 76.8
1 77.0 -

Thanks,
Ilya

Is there a way to check from

Ilya_M_1 — Mon, 21 Dec 2015 13:21:12 GMT

Is there a way to check from within Linux the snooping method configured in the BIOS? I.e. MSR or anything else?

Thanks,
Ilya

I don't know of any

McCalpinJohn — Mon, 21 Dec 2015 20:11:28 GMT

I don't know of any programmatic way of checking this definitively. I usually just check the local memory latency and guess which mode it is in....

I seem to recall that some of the uncore counters behave differently with home snoop vs early snoop. In the QPI Link Layer counters, the RxL_FLITS_G1.SNOOP count is zero when home snoop is enabled (because the chip receives read requests rather than snoop requests in this mode). If it gets non-zero counts when in "early snoop" mode, then it could be used as a differentiator.... I don't have a machine that I can test this on right now...