Intel Memory Latency Checker V3.0 released

Thomas_W_Intel · ‎11-11-2015

We are happy to announce that v3.0 of Intel MLC is now available at http://www.intel.com/software/mlc.

Highlights of the new release include:

Support for client processors
New measurement of cache-to-cache data transfer latencies
Support for measuring latencies and bandwidth to persistent memory
Allocate memory based on NUMA topology. This allows Intel® MLC to measure latencies on all the NUMA nodes on a processor including Cluster-on-Die configuration where there are 4 NUMA nodes on a 2-socket system, or NUMA nodes which have only memory without any compute resources.
Options to use 256-bit and 512-bit loads and stores in generating bandwidth traffic
Fine-grained control for read/write ratios, buffer sizes, NUMA node, etc. on a per-thread basis

Travis_D_ · ‎02-01-2017

Handy tool.

Here's an interesting note:

When I use ./mlc --idle_latency, I get a figure of around 95 - 100ns on my core i7-6700HQ, which seems a bit high. If I just spin up another process that does nothing but hot loop (e.g., <pre>while [ true ]; do true; done</pre> in bash, or just a tight loop in C) in another terminal window, my latency numbers improve dramatically to 55 ns or so. I've seen it with other benchmarks: running something else concurrently speeds up the benchmark, but never the nearly 2x speedup this shows.

Unlike earlier threads about "spinners" helping out latency (and to a lesser extend, bandwidth), this is a single-socket laptop.

I guess maybe the uncore is ramping down or something between accesses, but having something hot on another core keeps it active.

It's weird because it kind of violates one of the main inequalities of multi-threading: running something on N cores is going to speed it up by at most N, never more. Here, by running things on 2 cores you might get stuff done 3.5 times as fast as 1 core. Wow!

McCalpinJohn · ‎02-02-2017

I have not worked with MLC on single-socket systems, but I have seen cases where a latency test does generate enough uncore traffic to convince the power control unit to increase the uncore frequency to maximum. In some cases the uncore kicks up to full speed only if the cores are running at high frequency, but if the cores are limited to low frequency, the uncore stays at its minimum frequency.

If you have the msr-tools package installed, you may be able to control the range of available uncore frequencies with MSR 0x620 as discussed in the forum post at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/600913#comment-1872473 ; I have not checked to see if that MSR works on the "client" uncore -- it works on all of my recent 2-socket Xeon E5 systems.

Travis_D_ · ‎02-02-2017

Thanks, Dr. McCalpin, for the useful answer as always (Intel should really give you a stipend or something). It could indeed be the uncore frequency staying low. The difference is quite significant, and as I recall there were a lot of benchmarks being released around the time of Skylake release showing poorer latency on Skylake and blaming it on DDR4. In fact, a large component of that might be that the tests run into this behavior.

I forgot the easiest way to see this effect - simply note that the "loaded" latency (for moderate load levels) is much better than the "idle" latency! Here's a test I ran now, where I got a 99.2 ns "dle latency, but with loaded latencies as good as 60 ns. The latency is even much better (~72 ns) when the other thread is pushing ~15 GB/s through the memory controller.

Intel(R) Memory Latency Checker - v3.1a
Measuring idle latencies (in ns)...
	Memory node
Socket	     0	
     0	  99.2	

Measuring Peak Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :	22206.9	
3:1 Reads-Writes :	22605.9	
2:1 Reads-Writes :	26390.8	
1:1 Reads-Writes :	31431.2	
Stream-triad like:	23773.8	

Measuring Memory Bandwidths between nodes within system 
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
	Memory node
 Socket	     0	
     0	18723.0	

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject	Latency	Bandwidth
Delay	(ns)	MB/sec
==========================
 00000	166.17	  17601.9
 00002	211.50	  18073.5
 00008	150.48	  20254.5
 00015	137.81	  22579.3
 00050	109.73	  23631.3
 00100	 72.77	  15474.0
 00200	 72.11	   9517.0
 00300	 66.20	   6924.4
 00400	 63.58	   5711.3
 00500	 62.49	   4838.9
 00700	 61.66	   3829.5
 01000	 63.99	   2973.8
 01300	 60.93	   2589.2
 01700	 60.56	   2248.6
 02500	 60.47	   1873.3
 03500	 60.09	   1647.4
 05000	 60.40	   1469.5
 09000	 59.75	   1299.0
 20000	 60.27	   1165.4

Travis_D_ · ‎02-02-2017

Dr. McCalpin was on to something. I checked turbostat --idle_latency while the test was running and indeed the core frequency is very low, hovering almost constantly around 1050 MHz, far from the nominal 2600 MHz, and further from the 3500 MHz frequency.

OK, so what's up with that? I was using the default intel_pstate cpu-freq driver which defaults to the powersave governor Note that this driver is mostly letting the CPU do its own thing wrt power management (that is, enables "hardware power management" aka HWP), but still provides hints. I tried flipping this over to performance: cpupower -c 0,1,2,3 frequency-info, and boom! good latency figures of ~56 ns! It also brought all the bandwidth figures for the different read/write rations closer to the max of ~32 GB/s, so apparently some of the patterns triggered frequency reductions more than others.

Turbostat showed that the cpu running the bench was ticking along near the max turbo frequency of 3.5 GHz. Perhaps this also had an effect on the uncore, as Dr. McCalpin mentioned: in fact, maybe that's the primary effect as I wouldn't expect the core frequency to make a big difference on a pure memory latency test, except indirectly if it affects the uncore speed.

Here are the results with the "performance" governor:

Intel(R) Memory Latency Checker - v3.1a
Measuring idle latencies (in ns)...
	Memory node
Socket	     0	
     0	  55.9	

Measuring Peak Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :	30128.9	
3:1 Reads-Writes :	29259.1	
2:1 Reads-Writes :	29527.6	
1:1 Reads-Writes :	31969.6	
Stream-triad like:	27566.5	

Measuring Memory Bandwidths between nodes within system 
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
	Memory node
 Socket	     0	
     0	30229.2	

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject	Latency	Bandwidth
Delay	(ns)	MB/sec
==========================
 00000	112.58	  29276.1
 00002	112.58	  29129.4
 00008	111.95	  28380.3
 00015	111.06	  27529.2
 00050	 94.05	  25262.2
 00100	 68.94	  17691.0
 00200	 63.16	  10947.1
 00300	 60.94	   8049.6
 00400	 60.03	   6470.3
 00500	 59.64	   5468.0
 00700	 58.87	   4297.0
 01000	 58.01	   3389.0
 01300	 57.92	   2880.5
 01700	 58.02	   2468.0
 02500	 57.64	   2045.9
 03500	 59.36	   1741.7
 05000	 60.62	   1521.4
 09000	 66.87	   1211.6
 20000	 59.60	   1190.8

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency	23.9
Local Socket L2->L2 HITM latency	28.4

I didn't actually dig into the MSRs to see what power management hints the

intel_pstate

driver changed, but they evidently made a huge difference. There is an argument to be made here the power management is much too aggressive here: the idea is that energy efficient turbo can ramp down core and/or uncore speeds when some heuristic (based on memory stalls) determines that the higher speeds aren't useful: but here it gets it wrong as latency increases massively. Real code that was latency limited would take 50% longer or more.

Travis_D_ · ‎02-02-2017

FWIW, it isn't the case that the powersave governor just keeps the frequency universally low. If you run a CPU-intensive bench (i.e., not full of memory stalls) that governor has no issue keeping the CPU at it's turbo frequency. So it's definitely an efficiency heuristic kicking in...