We are happy to announce that v3.0 of Intel MLC is now available at http://www.intel.com/software/mlc.
Highlights of the new release include:
Support for client processors
New measurement of cache-to-cache data transfer latencies
Support for measuring latencies and bandwidth to persistent memory
Allocate memory based on NUMA topology. This allows Intel® MLC to measure latencies on all the NUMA nodes on a processor including Cluster-on-Die configuration where there are 4 NUMA nodes on a 2-socket system, or NUMA nodes which have only memory without any compute resources.
Options to use 256-bit and 512-bit loads and stores in generating bandwidth traffic
Fine-grained control for read/write ratios, buffer sizes, NUMA node, etc. on a per-thread basis
Here's an interesting note:
When I use ./mlc --idle_latency, I get a figure of around 95 - 100ns on my core i7-6700HQ, which seems a bit high. If I just spin up another process that does nothing but hot loop (e.g., <pre>while [ true ]; do true; done</pre> in bash, or just a tight loop in C) in another terminal window, my latency numbers improve dramatically to 55 ns or so. I've seen it with other benchmarks: running something else concurrently speeds up the benchmark, but never the nearly 2x speedup this shows.
Unlike earlier threads about "spinners" helping out latency (and to a lesser extend, bandwidth), this is a single-socket laptop.
I guess maybe the uncore is ramping down or something between accesses, but having something hot on another core keeps it active.
It's weird because it kind of violates one of the main inequalities of multi-threading: running something on N cores is going to speed it up by at most N, never more. Here, by running things on 2 cores you might get stuff done 3.5 times as fast as 1 core. Wow!
I have not worked with MLC on single-socket systems, but I have seen cases where a latency test does generate enough uncore traffic to convince the power control unit to increase the uncore frequency to maximum. In some cases the uncore kicks up to full speed only if the cores are running at high frequency, but if the cores are limited to low frequency, the uncore stays at its minimum frequency.
If you have the msr-tools package installed, you may be able to control the range of available uncore frequencies with MSR 0x620 as discussed in the forum post at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring...; I have not checked to see if that MSR works on the "client" uncore -- it works on all of my recent 2-socket Xeon E5 systems.
Thanks, Dr. McCalpin, for the useful answer as always (Intel should really give you a stipend or something). It could indeed be the uncore frequency staying low. The difference is quite significant, and as I recall there were a lot of benchmarks being released around the time of Skylake release showing poorer latency on Skylake and blaming it on DDR4. In fact, a large component of that might be that the tests run into this behavior.
I forgot the easiest way to see this effect - simply note that the "loaded" latency (for moderate load levels) is much better than the "idle" latency! Here's a test I ran now, where I got a 99.2 ns "dle latency, but with loaded latencies as good as 60 ns. The latency is even much better (~72 ns) when the other thread is pushing ~15 GB/s through the memory controller.
Intel(R) Memory Latency Checker - v3.1a Measuring idle latencies (in ns)... Memory node Socket 0 0 99.2 Measuring Peak Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads : 22206.9 3:1 Reads-Writes : 22605.9 2:1 Reads-Writes : 26390.8 1:1 Reads-Writes : 31431.2 Stream-triad like: 23773.8 Measuring Memory Bandwidths between nodes within system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Memory node Socket 0 0 18723.0 Measuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 166.17 17601.9 00002 211.50 18073.5 00008 150.48 20254.5 00015 137.81 22579.3 00050 109.73 23631.3 00100 72.77 15474.0 00200 72.11 9517.0 00300 66.20 6924.4 00400 63.58 5711.3 00500 62.49 4838.9 00700 61.66 3829.5 01000 63.99 2973.8 01300 60.93 2589.2 01700 60.56 2248.6 02500 60.47 1873.3 03500 60.09 1647.4 05000 60.40 1469.5 09000 59.75 1299.0 20000 60.27 1165.4
Dr. McCalpin was on to something. I checked turbostat --idle_latency while the test was running and indeed the core frequency is very low, hovering almost constantly around 1050 MHz, far from the nominal 2600 MHz, and further from the 3500 MHz frequency.
OK, so what's up with that? I was using the default intel_pstate cpu-freq driver which defaults to the powersave governor Note that this driver is mostly letting the CPU do its own thing wrt power management (that is, enables "hardware power management" aka HWP), but still provides hints. I tried flipping this over to performance: cpupower -c 0,1,2,3 frequency-info, and boom! good latency figures of ~56 ns! It also brought all the bandwidth figures for the different read/write rations closer to the max of ~32 GB/s, so apparently some of the patterns triggered frequency reductions more than others.
Turbostat showed that the cpu running the bench was ticking along near the max turbo frequency of 3.5 GHz. Perhaps this also had an effect on the uncore, as Dr. McCalpin mentioned: in fact, maybe that's the primary effect as I wouldn't expect the core frequency to make a big difference on a pure memory latency test, except indirectly if it affects the uncore speed.
Here are the results with the "performance" governor:
Intel(R) Memory Latency Checker - v3.1a Measuring idle latencies (in ns)... Memory node Socket 0 0 55.9 Measuring Peak Memory Bandwidths for the system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using traffic with the following read-write ratios ALL Reads : 30128.9 3:1 Reads-Writes : 29259.1 2:1 Reads-Writes : 29527.6 1:1 Reads-Writes : 31969.6 Stream-triad like: 27566.5 Measuring Memory Bandwidths between nodes within system Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec) Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Memory node Socket 0 0 30229.2 Measuring Loaded Latencies for the system Using all the threads from each core if Hyper-threading is enabled Using Read-only traffic type Inject Latency Bandwidth Delay (ns) MB/sec ========================== 00000 112.58 29276.1 00002 112.58 29129.4 00008 111.95 28380.3 00015 111.06 27529.2 00050 94.05 25262.2 00100 68.94 17691.0 00200 63.16 10947.1 00300 60.94 8049.6 00400 60.03 6470.3 00500 59.64 5468.0 00700 58.87 4297.0 01000 58.01 3389.0 01300 57.92 2880.5 01700 58.02 2468.0 02500 57.64 2045.9 03500 59.36 1741.7 05000 60.62 1521.4 09000 66.87 1211.6 20000 59.60 1190.8 Measuring cache-to-cache transfer latency (in ns)... Local Socket L2->L2 HIT latency 23.9 Local Socket L2->L2 HITM latency 28.4
I didn't actually dig into the MSRs to see what power management hints the
driver changed, but they evidently made a huge difference. There is an argument to be made here the power management is much too aggressive here: the idea is that energy efficient turbo can ramp down core and/or uncore speeds when some heuristic (based on memory stalls) determines that the higher speeds aren't useful: but here it gets it wrong as latency increases massively. Real code that was latency limited would take 50% longer or more.
FWIW, it isn't the case that the powersave governor just keeps the frequency universally low. If you run a CPU-intensive bench (i.e., not full of memory stalls) that governor has no issue keeping the CPU at it's turbo frequency. So it's definitely an efficiency heuristic kicking in...