Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Intel Memory Latency Checker v2.0 released

Thomas_W_Intel
Employee
2,728 Views

A new version of Intel Memory Latency Checker v2.0 (Intel MLC) has recently been posted at http://www.intel.com/software/mlc

Apart from the unloaded memory latency, Intel MLC can now measure memory bandwidth and loaded latencies as well.

0 Kudos
24 Replies
Bernard
Valued Contributor I
2,307 Views

Is this Linux only?

0 Kudos
Thomas_W_Intel
Employee
2,307 Views

Yes, Intel MLC is only available on Linux at the moment.

0 Kudos
Bernard
Valued Contributor I
2,307 Views

Thanks Thomas.

0 Kudos
McCalpinJohn
Honored Contributor III
2,307 Views

Interesting tool....

Results on my Xeon E5-2680 systems are plausible, though it is not completely clear what is being measured in all cases, and the results often show significantly better performance than I have been able to obtain....

One item you should be aware of -- the tester shows a memory latency of 13.0 ns on an Intel Atom C2750 (Silvermont) system.  I think it is safe to say that this is not a valid DRAM latency measurement.   I have only done a little bit of testing on that system, but I noticed that one code that I use to test latency also appears to be unexpectedly activating a hardware prefetcher -- meaning that the Silvermont has a smarter hardware prefetcher than Sandy Bridge EP.   With the hardware prefetchers disabled, I get latency values of about 104 ns, which seems reasonable for a low-power system running at ~2.4 GHz.   Interestingly, the value presented by the MemoryLatencyChecker tool is exactly 1/8 of the value that I get with HW prefetch.

0 Kudos
Patrick_F_Intel1
Employee
2,307 Views

Hello Dr. McCalpin,

You can't disable the hardware prefetchers on the silvermont based Atom chip.

Pat

0 Kudos
Thomas_W_Intel
Employee
2,307 Views

Do the results match better with your expectation when you are using random access reads? You can trigger them like this:

./mlc --latency_matrix -r

 

0 Kudos
McCalpinJohn
Honored Contributor III
2,307 Views

Argh -- the stupid forum ate my last reply.

Short answer -- "mlc --idle_latency -r" returns ~53 ns on the C2750 -- better, but still way off of the ~100 ns that I get with my most reliable test codes.

0 Kudos
McCalpinJohn
Honored Contributor III
2,307 Views

Some odd results on my Xeon E5-2680's

1. Running "mlc" with no arguments gives a latency matrix of 67.0 ns local and 116.8 ns remote, while the explicit tests "mlc --idle_latency" and "mlc --idle_latency -c1 -i8" give the more reasonable results of 79.5 ns and 135.6 ns (agreeing with my pointer-chasing tests).  Both results are repeatable with very little variability.

2. Running "mlc" with no arguments gives a bandwidth matrix of ~46 GB/s local and ~21.3 GB/s remote, but running "mlc --bandwidth_matrix" gives ~45 GB/s local and ~17.0 GB/s remote.  Both claim to be running "Read-only" traffic (and both results are repeatable with very little variability).

0 Kudos
Thomas_W_Intel
Employee
2,307 Views

Thanks a lot for for running the test with random pattern on C2750. MLC should probably print out that this processor is not supported.

I tried to reproduce your numbers on one of our Intel Xeon E5-2680 processors. I could not reproduce the latency discrepancy, but I see a somewhat similar effect for the latency matrix. I will look into it. 

0 Kudos
Arfi_M_
Beginner
2,307 Views

A very well briefly written post, with lots of useful information.

Thanks for sharing.

0 Kudos
Krishnaswa_V_Intel
2,307 Views

Hi Dr. McCalpin, I was looking into the issues that you have posted and have some suggestions for you to try

      > mlc --idle_latency -r" returns ~53 ns on the C2750 -- better, but still way off of the ~100 ns that I get with my most reliable test codes

-r option alone is not sufficient. Can you please add -l128 also? This would do 128 byte stride and avoid the adjacent sector prefetch

     > Running "mlc" with no arguments gives a latency matrix of 67.0 ns local and 116.8 ns remote, while the explicit tests "mlc --idle_latency" and "mlc --     idle_latency -c1 -i8" give the more reasonable results of 79.5 ns and 135.6 ns

Due to aggressive power management in newer generation of processors, you need to have at least one core really active on each of the sockets to get the best latency. Otherwise, snoops take longer to respond. In the default case w/ no arguments, we automatically launch dummy (running busy loop) threads on each socket and that is why we get the best possible latency. But when the user runs ./mlc --idle_latency, we expect the user to provide all the options including the -p option to minimize the complexity of the code. Please refer to section 6.1 in the readme.pdf. You can see this file in the MLC package that you downloaded

    >  Running "mlc" with no arguments gives a bandwidth matrix of ~46 GB/s local and ~21.3 GB/s remote, but running "mlc --bandwidth_matrix" gives ~45 GB/s local and ~17.0 GB/s remote

We found the issue and will be posting a fix soon. Thanks for bringing this to our attention.

Thanks for all the great feedback

0 Kudos
McCalpinJohn
Honored Contributor III
2,307 Views

Arg.... I did not realize that we did not disable C1e when we disabled all the other C states....   Time to update my numbers....

0 Kudos
Krishnaswa_V_Intel
2,307 Views

Disabling C1e is not sufficient. Unless you keep at least one core active all the time, uncore frequency will not scale resulting in increased snoop response times

0 Kudos
McCalpinJohn
Honored Contributor III
2,307 Views

I have not tested this by disabling the C1E state, but the documentation that I have read implies that disabling C1E should be enough to keep the uncore frequencies at its previous value.  The document "Intel Xeon Processor E5-1600/E5-2600/E5-4600 Product Families: Datasheet - Volume 1" (326508-002, May 2012), page 95 says "No additional power reduction actions are taken in the C1 state.  However, if the C1E substate is enabled, the processor automatically transitions to the lowest supported clock frequency, followed by a reduction in voltage."

I have gone back and re-run my latency tests with a "spinner" program on the other socket to keep at least one core active.  I used the cpufreq interface to manual set the core frequencies on each chip to each of the allowed values (1200, 1400, 1600, 1800, 2000, 2200, 2400, 2700, 2701 (i.e. 3100) MHz) and used the fixed function counter in the uncore UBOX (MSR 0xC09) to verify that the uncore frequency tracked the frequency of the "spinning" core in all cases.

With the hardware prefetchers disabled, I ran a standard pointer-chasing latency test with strides of 64B, 128B, and 256B for all frequency combinations.  Taking the minimum over the various strides show a smooth increase in latency from 67.0 ns (both chips at 3.1 GHz) to 79.6 ns (the "active" chip at 3.1 GHz and the "passive" chip at 1.2 GHz) -- an increase of almost 19%.   The 79.6 ns exactly matches my previous latency measurements which had no "spinning" process on the other chip.  The full set of results are consistent with a model that puts 24-25 cycles of the total latency in the domain of the "passive" chip's clock frequency.

A local latency increase should result in a decrease of local memory bandwidth, and I observed a 4%-6% increase in single-thread read bandwidth when I added the "spinner" process to the other chip.

Of course remote bandwidth is going to be much more sensitive to the uncore frequency on the remote chip.  I measured a 23% increase in remote read bandwidth when I added the "spinner" process to the chip serving the DRAM.  My read-only bandwidth results are still much lower than those reported by the Intel MLC, but this was a single-threaded test case so I still have a lot of work to do....

0 Kudos
Krishnaswa_V_Intel
2,307 Views

Dr. McCalpin, I am glad to see the latency numbers are making sense for you with the 'spinner' on the other socket. Regarding single-threaded b/w, you can use ./mlc --peak_bandwidth -m1 option to do so. -m option is basically a mask where each bit set represents the logical cpu# where you want the threads to run. Currently, we don't run the spinner automatically for b/w tests (as we typically run b/w threads on all cores) but do so only for latency tests. So, you need to manually run the spinner if you test single thread b/w

0 Kudos
EGard
Beginner
2,307 Views

Thanks for this informative discussion.  Other than keeping a core active, are there other mechanisms to minimize latency to prevent uncore frequency scaling, such as tuning uncore ratio limits (min, max), etc?  I am specifically interested in minimizing latency for PCI Express devices streaming data to host memory.

 

0 Kudos
McCalpinJohn
Honored Contributor III
2,307 Views

I don't believe that the uncore frequency can be controlled directly -- it appears to be set by the hardware to match the frequency of the fastest core.

So to keep the uncore at maximum frequency I can think of four options:

  1. Run a "spinner" code on the chip.
  2. Disable C1E state in the BIOS.  Note that this may void your warranty!  The processor datasheet says that C1E state must be enabled for the processor to remain within specifications.
  3. Disable the use of "Halt" state in the operating system.   I don't remember the details, but there is a boot option that can be given to Linux kernels to cause them to use "busy-wait" instead of "halt" for idle cores.  This should prevent any cores from going into C1 state.  Obviously it will increase the power consumption significantly (maybe 2x?) on an otherwise idle chip compared to the package C1E state and will increase the power consumption dramatically (maybe 4x?) on an otherwise idle chip compared to the package C3 and higher-numbered states.
  4. (Some details missing:)  Disable the use of the C1E substate hint in the "intel_idle.c" C state control code and disable C1E auto-promotion in the POWER_CTL MSR.
    1. Volume 1 of the Xeon E5-1600/2600/4600 datasheet says that package C1E will only be entered if all cores have executed an MWAIT with the C1E sub-hint (or are at a lower-power C state) or if all cores are in C1 state and auto-promotion is enabled in the POWER_CTL MSR (MSR 0x1FC).
    2. Volume 3 of the SW developer's guide does not describe the bit fields of MSR_POWER_CTL for Sandy Bridge, but for Nehalem there is a description of the function of bit 1 and this description matches the "auto-promotion" functionality described in the Xeon E5 datasheet.   It would be easy enough to test whether this bit has the same function on the Sandy Bridge systems.  (First you would need to make sure that the MWAIT was not using the C1E substate hint, then you would need to measure uncore frequency on an otherwise idle chip.  I use something like 
       rdmsr -p 9 -d 0xC09; sleep 10; rdmsr -p 9 0xC09
      and see how many uncore clocks are accrued during the 10 second sleep.  (The UBOX fixed function cycle counter must be enabled by setting bit 22 in MSR 0xC08 before doing this test.)
    3. This will increase power consumption somewhat relative to C1E, but not nearly as much as preventing as using spin-waiting instead of "MWAIT" into C1 state.

The requirement that C1E state be enabled to remain with specifications seems a bit strange, given that there is no requirement that cores ever be left idle.  It is probably required for Energy Star certification or something similar.  

Of course what you really want is none of the above -- you want the uncore frequency to drop only when the impact on performance is actually small.  If the rate of probes from the other chip is relatively low, then increasing the latency of those probes by ~19% could easily be tolerable. Even if the rate of probes is high increasing the latency may be tolerable -- it depends on whether the codes running on the other chip are generating enough concurrency to tolerate the added latency.   On the other hand if there is a high rate of memory access, then the uncore should remain at a high frequency even if none of the local cores are active.  I don't see any way to force this behavior automatically, so a software approach that keeps a core active on the chip servicing the memory requests seems like the most precise approach.

0 Kudos
EGard
Beginner
2,307 Views

Dr. McCalpin, thank you for the detailed response.  As you've suggested, I would expect that disabling C1E should be sufficient to minimize latency due to uncore frequency scaling, so I was surprised to read in this thread that a spinner thread was still necessary.  If I need to keep a CPU spinning to minimize latency, is there even a reason to disable C1E?  Presumably the package would never sleep in this case.

In my specific use case, I have PCI Express devices writing to host memory while the CPU cores are relatively idle, and I want to ensure that these PCIe devices can stream predictably to prevent device-side overflow conditions, etc.  So, if it were possible to force a given floor for the uncore frequency, or force it to its max ratio, this is desirable for my use case.

Regarding the warranty/reliability implications of disabling C1E, I've seen these references also.  I have always assumed that Intel's reliability assumptions include a certain percentage of idle time (i.e. c-state residency), though I can't confirm.

Thanks again,

Eric

0 Kudos
Patrick_F_Intel1
Employee
2,307 Views

Note that the spinner process can use the lowest priority so that it doesn't interfere with anything else trying to actually get work done.

0 Kudos
McCalpinJohn
Honored Contributor III
2,097 Views

I re-ran the Intel Memory Latency Checker "peak_bandwidth" test using a single thread with and without a "spinner" on the other socket.

In this case the spinner thread made a larger difference in bandwidth than I saw with my single-thread ReadOnly test case..

	                Avg w/out	Avg w	        Uplift
ALL Reads        :	12572.05	14173.15	12.7%
3:1 Reads-Writes :	16598.55	17418.80	4.9%
2:1 Reads-Writes :	18088.45	18828.35	4.1%
1:1 Reads-Writes :	20523.40	21712.05	5.8%
Stream-triad like:	10385.80	11411.90	9.9%

My ReadOnly test case appears to be more heavily optimized for single-thread performance -- it delivered ~17.5 GB/s without the "spinner" on the other socket and ~18.3 GB/s with the "spinner" on the other socket.  My extra optimizations include using large pages, splitting the read stream into two halves (to increase the number of pages that the L2 prefetcher could operate on), and adding software prefetches anywhere between 8 and 15 lines ahead of the current load for each of the two read streams.

0 Kudos
Reply