- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a dual socket Dell Server with 3.2 GHz E5-2667 v4's. Hyperthreading is off. There are two NUMA domains and I am using domain 0 which has even numbered cores. My compiler is gcc 4.8.5 and OS is openSUSE Leap 42.2.
Code for a single producer single consumer queue is attached in main.cpp. The end-to-end latency of this queue, as printed by the program, is over 300 tsc cycles which comes to around 100 ns. On the other hand, the Intel Memory Latency Checker ("mlc -e") gives latencies less than 10 ns, presumably for similar operations. How does one reconcile this factor of 10 difference in latency measurements?
Would appreciate any help on this matter.
Prasoon
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As clearly explained in the documentation for the Intel Memory Latency Checker, the "-e" flag instructs the code to leave the hardware prefetchers enabled.
The documentation includes a section "Measuring idle latency with Random Access" that provides examples of how to run latency tests when you don't have permission to disable the HW prefetchers.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As clearly explained in the documentation for the Intel Memory Latency Checker, the "-e" flag instructs the code to leave the hardware prefetchers enabled.
The documentation includes a section "Measuring idle latency with Random Access" that provides examples of how to run latency tests when you don't have permission to disable the HW prefetchers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the pointer, John. "./mlc -e --idle_latency -c4 -i2 -r" reports 78 ns latency. This seems to close the issue unless you have some more insights.
Prasoon
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
78 ns is not an unreasonably number for the Xeon E5-2667 v4 for two reasons:
- The frequency is high -- single-core Turbo is 3.6 GHz
- It is also very likely to be a "single-ring" configuration -- i.e., just the left hand side of the block diagram in Figure 1-2 of the "Intel Xeon Processor E5 and E7 v4 Product Families Uncore Performance Monitoring Reference Manual" (Intel document 334291). This means fewer "hops" to get from the core to the L3 slice that owns the address, then to the (single) Home Agent, then back to the L3 and core.
Measuring "latency" is extraordinarily complex in current systems -- mostly because you need to understand lots of poorly documented microarchitectural details to be able to define "latency" clearly enough to create a methodology. You need to understand the snooping option selected in the BIOS, the uncore frequency controls, the core frequency controls, the hardware prefetcher mechanisms and controls (where available), and the mapping of physical addresses to the DRAM channels, DRAM ranks (if applicable), and DRAM banks.
If you get the opportunity to run this as the root user, you should try both the original version and the randomized version without the "-e" flag.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page