A new version of Intel Memory Latency Checker v2.0 (Intel MLC) has recently been posted at http://www.intel.com/software/mlc
Apart from the unloaded memory latency, Intel MLC can now measure memory bandwidth and loaded latencies as well.
Results on my Xeon E5-2680 systems are plausible, though it is not completely clear what is being measured in all cases, and the results often show significantly better performance than I have been able to obtain....
One item you should be aware of -- the tester shows a memory latency of 13.0 ns on an Intel Atom C2750 (Silvermont) system. I think it is safe to say that this is not a valid DRAM latency measurement. I have only done a little bit of testing on that system, but I noticed that one code that I use to test latency also appears to be unexpectedly activating a hardware prefetcher -- meaning that the Silvermont has a smarter hardware prefetcher than Sandy Bridge EP. With the hardware prefetchers disabled, I get latency values of about 104 ns, which seems reasonable for a low-power system running at ~2.4 GHz. Interestingly, the value presented by the MemoryLatencyChecker tool is exactly 1/8 of the value that I get with HW prefetch.
Argh -- the stupid forum ate my last reply.
Short answer -- "mlc --idle_latency -r" returns ~53 ns on the C2750 -- better, but still way off of the ~100 ns that I get with my most reliable test codes.
Some odd results on my Xeon E5-2680's
1. Running "mlc" with no arguments gives a latency matrix of 67.0 ns local and 116.8 ns remote, while the explicit tests "mlc --idle_latency" and "mlc --idle_latency -c1 -i8" give the more reasonable results of 79.5 ns and 135.6 ns (agreeing with my pointer-chasing tests). Both results are repeatable with very little variability.
2. Running "mlc" with no arguments gives a bandwidth matrix of ~46 GB/s local and ~21.3 GB/s remote, but running "mlc --bandwidth_matrix" gives ~45 GB/s local and ~17.0 GB/s remote. Both claim to be running "Read-only" traffic (and both results are repeatable with very little variability).
Thanks a lot for for running the test with random pattern on C2750. MLC should probably print out that this processor is not supported.
I tried to reproduce your numbers on one of our Intel Xeon E5-2680 processors. I could not reproduce the latency discrepancy, but I see a somewhat similar effect for the latency matrix. I will look into it.
Hi Dr. McCalpin, I was looking into the issues that you have posted and have some suggestions for you to try
> mlc --idle_latency -r" returns ~53 ns on the C2750 -- better, but still way off of the ~100 ns that I get with my most reliable test codes
-r option alone is not sufficient. Can you please add -l128 also? This would do 128 byte stride and avoid the adjacent sector prefetch
> Running "mlc" with no arguments gives a latency matrix of 67.0 ns local and 116.8 ns remote, while the explicit tests "mlc --idle_latency" and "mlc -- idle_latency -c1 -i8" give the more reasonable results of 79.5 ns and 135.6 ns
Due to aggressive power management in newer generation of processors, you need to have at least one core really active on each of the sockets to get the best latency. Otherwise, snoops take longer to respond. In the default case w/ no arguments, we automatically launch dummy (running busy loop) threads on each socket and that is why we get the best possible latency. But when the user runs ./mlc --idle_latency, we expect the user to provide all the options including the -p option to minimize the complexity of the code. Please refer to section 6.1 in the readme.pdf. You can see this file in the MLC package that you downloaded
> Running "mlc" with no arguments gives a bandwidth matrix of ~46 GB/s local and ~21.3 GB/s remote, but running "mlc --bandwidth_matrix" gives ~45 GB/s local and ~17.0 GB/s remote
We found the issue and will be posting a fix soon. Thanks for bringing this to our attention.
Thanks for all the great feedback
Disabling C1e is not sufficient. Unless you keep at least one core active all the time, uncore frequency will not scale resulting in increased snoop response times
I have not tested this by disabling the C1E state, but the documentation that I have read implies that disabling C1E should be enough to keep the uncore frequencies at its previous value. The document "Intel Xeon Processor E5-1600/E5-2600/E5-4600 Product Families: Datasheet - Volume 1" (326508-002, May 2012), page 95 says "No additional power reduction actions are taken in the C1 state. However, if the C1E substate is enabled, the processor automatically transitions to the lowest supported clock frequency, followed by a reduction in voltage."
I have gone back and re-run my latency tests with a "spinner" program on the other socket to keep at least one core active. I used the cpufreq interface to manual set the core frequencies on each chip to each of the allowed values (1200, 1400, 1600, 1800, 2000, 2200, 2400, 2700, 2701 (i.e. 3100) MHz) and used the fixed function counter in the uncore UBOX (MSR 0xC09) to verify that the uncore frequency tracked the frequency of the "spinning" core in all cases.
With the hardware prefetchers disabled, I ran a standard pointer-chasing latency test with strides of 64B, 128B, and 256B for all frequency combinations. Taking the minimum over the various strides show a smooth increase in latency from 67.0 ns (both chips at 3.1 GHz) to 79.6 ns (the "active" chip at 3.1 GHz and the "passive" chip at 1.2 GHz) -- an increase of almost 19%. The 79.6 ns exactly matches my previous latency measurements which had no "spinning" process on the other chip. The full set of results are consistent with a model that puts 24-25 cycles of the total latency in the domain of the "passive" chip's clock frequency.
A local latency increase should result in a decrease of local memory bandwidth, and I observed a 4%-6% increase in single-thread read bandwidth when I added the "spinner" process to the other chip.
Of course remote bandwidth is going to be much more sensitive to the uncore frequency on the remote chip. I measured a 23% increase in remote read bandwidth when I added the "spinner" process to the chip serving the DRAM. My read-only bandwidth results are still much lower than those reported by the Intel MLC, but this was a single-threaded test case so I still have a lot of work to do....
Dr. McCalpin, I am glad to see the latency numbers are making sense for you with the 'spinner' on the other socket. Regarding single-threaded b/w, you can use ./mlc --peak_bandwidth -m1 option to do so. -m option is basically a mask where each bit set represents the logical cpu# where you want the threads to run. Currently, we don't run the spinner automatically for b/w tests (as we typically run b/w threads on all cores) but do so only for latency tests. So, you need to manually run the spinner if you test single thread b/w
Thanks for this informative discussion. Other than keeping a core active, are there other mechanisms to minimize latency to prevent uncore frequency scaling, such as tuning uncore ratio limits (min, max), etc? I am specifically interested in minimizing latency for PCI Express devices streaming data to host memory.
I don't believe that the uncore frequency can be controlled directly -- it appears to be set by the hardware to match the frequency of the fastest core.
So to keep the uncore at maximum frequency I can think of four options:
rdmsr -p 9 -d 0xC09; sleep 10; rdmsr -p 9 0xC09and see how many uncore clocks are accrued during the 10 second sleep. (The UBOX fixed function cycle counter must be enabled by setting bit 22 in MSR 0xC08 before doing this test.)
The requirement that C1E state be enabled to remain with specifications seems a bit strange, given that there is no requirement that cores ever be left idle. It is probably required for Energy Star certification or something similar.
Of course what you really want is none of the above -- you want the uncore frequency to drop only when the impact on performance is actually small. If the rate of probes from the other chip is relatively low, then increasing the latency of those probes by ~19% could easily be tolerable. Even if the rate of probes is high increasing the latency may be tolerable -- it depends on whether the codes running on the other chip are generating enough concurrency to tolerate the added latency. On the other hand if there is a high rate of memory access, then the uncore should remain at a high frequency even if none of the local cores are active. I don't see any way to force this behavior automatically, so a software approach that keeps a core active on the chip servicing the memory requests seems like the most precise approach.
Dr. McCalpin, thank you for the detailed response. As you've suggested, I would expect that disabling C1E should be sufficient to minimize latency due to uncore frequency scaling, so I was surprised to read in this thread that a spinner thread was still necessary. If I need to keep a CPU spinning to minimize latency, is there even a reason to disable C1E? Presumably the package would never sleep in this case.
In my specific use case, I have PCI Express devices writing to host memory while the CPU cores are relatively idle, and I want to ensure that these PCIe devices can stream predictably to prevent device-side overflow conditions, etc. So, if it were possible to force a given floor for the uncore frequency, or force it to its max ratio, this is desirable for my use case.
Regarding the warranty/reliability implications of disabling C1E, I've seen these references also. I have always assumed that Intel's reliability assumptions include a certain percentage of idle time (i.e. c-state residency), though I can't confirm.
I re-ran the Intel Memory Latency Checker "peak_bandwidth" test using a single thread with and without a "spinner" on the other socket.
In this case the spinner thread made a larger difference in bandwidth than I saw with my single-thread ReadOnly test case..
Avg w/out Avg w Uplift ALL Reads : 12572.05 14173.15 12.7% 3:1 Reads-Writes : 16598.55 17418.80 4.9% 2:1 Reads-Writes : 18088.45 18828.35 4.1% 1:1 Reads-Writes : 20523.40 21712.05 5.8% Stream-triad like: 10385.80 11411.90 9.9%
My ReadOnly test case appears to be more heavily optimized for single-thread performance -- it delivered ~17.5 GB/s without the "spinner" on the other socket and ~18.3 GB/s with the "spinner" on the other socket. My extra optimizations include using large pages, splitting the read stream into two halves (to increase the number of pages that the L2 prefetcher could operate on), and adding software prefetches anywhere between 8 and 15 lines ahead of the current load for each of the two read streams.