- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm looking into memory performance results on a Xeon E5-2620V3 system with 2 NUMA nodes and 2 QPI links in between. With NUMA enabled in the BIOS, the Memory Latency Checker tool reports 44GB/s local throughput and 6GB/s remote, which looks too low.
Numa node Numa node 0 1 0 44266.2 6004.0 1 5980.9 44311.9
With NUMA disabled (which results in cache line interleaving AFAIU), the combined throughput is ~40GB/s. PCM shows an increased QPI traffic in this mode. So I would expect the figure to be somewhere in the middle between 44GB/s and 6GB/s with NUMA on.
Memory node Socket 0 1 0 39537.2 39588.7 1 39515.2 39527.0
Any ideas?
I'm also curious to know how the tool (mlc) measures the bandwidth? Does it rely on PMU counters, or does it just count the memory ops from the standpoint of a client?
Thanks,
Ilya
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Low core count parts have lower remote memory b/w in the default configuration which uses early snoop mode. However, if you change the snooping mode to "Home snoop" through BIOS, you will see much higher remote memory b/w. There will be a slight increase in the latency but b/w would improve
Vish
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I also saw unusually low numbers for remote accesses on a low-core-count Haswell EP (Xeon E5-2603 v3), but I no longer have access to the system to check the BIOS snoop configuration. The 2603 is even slower than the 2620 -- lower core clock rates and lower DRAM clock rates, with the Intel Memory Latency Checker delivering only 4.7 GB/s between sockets for the default (all reads) case.
The performance varies a bit with access pattern, but not by huge amounts, so the non-NUMA results you obtained are almost certainly not possible. If the data were actually interleaved between the sockets, the 39.5 GB/s you got in the second case would mean almost 20 GB/s from the local memory plus almost 20 GB/s across QPI from the remote memory. This is more than 3x the value that was measured directly, so it seems implausible.
You will have to check your BIOS documentation to be sure, but I don't think that disabling NUMA generates cache-line interleaving. You are much more likely to get no NUMA information, which will lead to uncontrollable pseudo-random page placement.
It is difficult to tell exactly what the Intel Memory Latency Checker is doing for each test. I know that it internally changes the CPU frequency to the maximum value and enables the hardware prefetchers if they are disabled, so it might also be obtaining NUMA information from the hardware that the OS is not aware of.
To test performance for interleaved memory, I recommend using a code that is more transparent, such as the STREAM benchmark (http://www.cs.virginia.edu/stream/). If you have the Intel compilers you can configure this to use streaming stores or to avoid streaming stores. With gcc you will not get streaming stores (except at "-O3" where the compiler will replace the "Copy" kernel with an optimized library routine that uses streaming stores).
A useful set of numbers might come from:
icc -O3 -xAVX2 -openmp -ffreestanding -DSTREAM_ARRAY_SIZE=80000000 -opt-streaming-stores always stream.c -o stream.nta.exe
export OMP_NUM_THREADS=6
export KMP_AFFINITY=verbose,compact # change to "verbose,compact,1" if HyperThreading is enabled
numactl --membind=0 ./stream.nta.exe # all local accesses on socket 0, streaming stores enabled
numactl --membind=1 ./stream.nta.exe # all threads on socket 0, all data on socket 1, with streaming stores
export OMP_NUM_THREADS=12
export KMP_AFFINITY=verbose,scatter
./stream.nta.exe # use both sockets, all accesses should be local, with streaming stores
numactl --interleave=0,1 ./stream.nta.exe # use both sockets, memory alternates between sockets by 4KiB page
icc -O3 -xAVX2 -openmp -ffreestanding -DSTREAM_ARRAY_SIZE=80000000 -opt-streaming-stores never stream.c -o stream.alloc.exe
export OMP_NUM_THREADS=6
export KMP_AFFINITY=verbose,compact # change to "verbose,compact,1" if HyperThreading is enabled
numactl --membind=0 ./stream.alloc.exe # all local accesses on socket 0, no streaming stores
numactl --membind=1 ./stream.alloc.exe # all threads on socket 0, all data on socket 1, no streaming stores
export OMP_NUM_THREADS=12
export KMP_AFFINITY=verbose,scatter
./stream.alloc.exe # use both sockets, all accesses should be local, no streaming stores
numactl --interleave=0,1 ./stream.alloc.exe # use both sockets, memory alternates between sockets by 4KiB page
If my testing on the Xeon E5-2603 v3 is any indication, your results using all the cores on a single socket and memory interleaved across the two chips should be somewhere in the range of 10 GB/s -- about 5 GB/s from local memory and about 5 GB/s over QPI from the other socket. This assumes a slightly lower QPI than the 6 GB/s you reported based on my observations that streaming stores are unusually slow on these low-frequency parts. On my Xeon E5-2603 v3, the Intel Memory Latency Checker showed a pattern like:
Using Read-only traffic type
Memory node
Socket 0 1
0 26893.0 4713.0
1 4688.6 27011.7
Using "-W5" (one read and one write)
Memory node
Socket 0 1
0 34972.4 5613.6 <-- remote is about 20% faster than all reads
1 5627.9 34708.2
Using "-W8" (one read and one non-temporal write)
Memory node
Socket 0 1
0 22906.1 3740.9 <-- remote is about 20% slower than all reads
1 3765.2 22984.6
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dr. McCalpin,
MLC does not change the frequency. It only eanbles all the hw prefetchers for bandwidth measurements and disables them if latency is measured. In the case of measuring cross-socket b/w, it will use all the cores on a socket and allocate memory on the other socket. Basically, a thread will attach itself to a core on the other socket, allocate memory, do first touch (to assure memory is coming from the other socket) and then pin to cores on the first socket and access the allocated memory. This will result in all references going to the remote socket assuming NUMA is enabled.
However, if NUMA is disabled, then the system would do 64 byte interleaving across sockets. That is, address x would be on socket0 while address x+64 would be from socket1. MLC does not do anything differently in this case. It would still have threads pinned to other socket, allocate memory and do a first touch. However, the memory allocated would come equally from both sockets as NUMA is disabled
The result that CodeMonkey is reporting for numa disabled is still possible as the resources to track the requests are differently allocated based on NUMA on or off. Basically the caching agent has resources that are used to track outstanding requests (misses). Requests to remote memory take longer than local memory and we typically want to provide more local memory b/w. Using various considerations, the resources are divided differently between local and remote requests. And this division is different between NUMA enabled vs disabled.
Vish
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vish, John,
Thanks for your elaborated replies!
I disabled Early Snooping in the BIOS and the bandwidth skyrocketed, paying around 10% penalty in latency, as Vish noted.
Intel(R) Memory Latency Checker - v3.0
Measuring idle latencies (in ns)...
Numa node
Numa node 0 1
0 85.7 128.9
1 128.0 85.6Measuring Peak Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 86216.5
3:1 Reads-Writes : 86969.6
2:1 Reads-Writes : 88117.0
1:1 Reads-Writes : 87380.4
Stream-triad like: 80178.1Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Numa node
Numa node 0 1
0 43272.7 25336.9
1 25346.7 43259.0Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay (ns) MB/sec
==========================
00000 150.63 86501.6
00002 150.88 86481.6
00008 149.23 86652.2
00015 148.95 86548.9
00050 115.06 73755.5
00100 104.84 56534.0
00200 96.78 31569.2
00300 93.22 21907.8
00400 91.18 16864.0
00500 90.30 13768.8
00700 89.12 10153.8
01000 88.35 7381.2
01300 87.46 5870.2
01700 87.36 4674.6
02500 87.43 3419.3
03500 86.27 2664.8
05000 86.09 2091.0
09000 85.86 1494.9
20000 85.89 1082.7Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT latency 31.8
Local Socket L2->L2 HITM latency 36.5
Remote Socket LLC->LLC HITM latency (data address homed in writer socket)
Reader Numa Node
Writer Numa Node 0 1
0 - 77.8
1 78.0 -
Remote Socket LLC->LLC HITM latency (data address homed in reader socket)
Reader Numa Node
Writer Numa Node 0 1
0 - 76.8
1 77.0 -
Thanks,
Ilya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is there a way to check from within Linux the snooping method configured in the BIOS? I.e. MSR or anything else?
Thanks,
Ilya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't know of any programmatic way of checking this definitively. I usually just check the local memory latency and guess which mode it is in....
I seem to recall that some of the uncore counters behave differently with home snoop vs early snoop. In the QPI Link Layer counters, the RxL_FLITS_G1.SNOOP count is zero when home snoop is enabled (because the chip receives read requests rather than snoop requests in this mode). If it gets non-zero counts when in "early snoop" mode, then it could be used as a differentiator.... I don't have a machine that I can test this on right now...
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page