Solved: Results Inconsistent with Memory Latency Checker

Prasoon_T_ · ‎06-23-2017

I have a dual socket Dell Server with 3.2 GHz E5-2667 v4's. Hyperthreading is off. There are two NUMA domains and I am using domain 0 which has even numbered cores. My compiler is gcc 4.8.5 and OS is openSUSE Leap 42.2.

Code for a single producer single consumer queue is attached in main.cpp. The end-to-end latency of this queue, as printed by the program, is over 300 tsc cycles which comes to around 100 ns. On the other hand, the Intel Memory Latency Checker ("mlc -e") gives latencies less than 10 ns, presumably for similar operations. How does one reconcile this factor of 10 difference in latency measurements?

Would appreciate any help on this matter.

Prasoon

McCalpinJohn · ‎06-23-2017

As clearly explained in the documentation for the Intel Memory Latency Checker, the "-e" flag instructs the code to leave the hardware prefetchers enabled.

The documentation includes a section "Measuring idle latency with Random Access" that provides examples of how to run latency tests when you don't have permission to disable the HW prefetchers.

View solution in original post

McCalpinJohn · ‎06-23-2017

As clearly explained in the documentation for the Intel Memory Latency Checker, the "-e" flag instructs the code to leave the hardware prefetchers enabled.

The documentation includes a section "Measuring idle latency with Random Access" that provides examples of how to run latency tests when you don't have permission to disable the HW prefetchers.

Prasoon_T_ · ‎06-23-2017

Thanks for the pointer, John. "./mlc -e --idle_latency -c4 -i2 -r" reports 78 ns latency. This seems to close the issue unless you have some more insights.

Prasoon

McCalpinJohn · ‎06-26-2017

78 ns is not an unreasonably number for the Xeon E5-2667 v4 for two reasons:

The frequency is high -- single-core Turbo is 3.6 GHz
It is also very likely to be a "single-ring" configuration -- i.e., just the left hand side of the block diagram in Figure 1-2 of the "Intel Xeon Processor E5 and E7 v4 Product Families Uncore Performance Monitoring Reference Manual" (Intel document 334291). This means fewer "hops" to get from the core to the L3 slice that owns the address, then to the (single) Home Agent, then back to the L3 and core.

Measuring "latency" is extraordinarily complex in current systems -- mostly because you need to understand lots of poorly documented microarchitectural details to be able to define "latency" clearly enough to create a methodology. You need to understand the snooping option selected in the BIOS, the uncore frequency controls, the core frequency controls, the hardware prefetcher mechanisms and controls (where available), and the mapping of physical addresses to the DRAM channels, DRAM ranks (if applicable), and DRAM banks.

If you get the opportunity to run this as the root user, you should try both the original version and the randomized version without the "-e" flag.