- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
I have a dual socket Dell Server with 3.2 GHz E5-2667 v4's. Hyperthreading is off. There are two NUMA domains and I am using domain 0 which has even numbered cores. My compiler is gcc 4.8.5 and OS is openSUSE Leap 42.2.
Code for a single producer single consumer queue is attached in main.cpp. The end-to-end latency of this queue, as printed by the program, is over 300 tsc cycles which comes to around 100 ns. On the other hand, the Intel Memory Latency Checker ("mlc -e") gives latencies less than 10 ns, presumably for similar operations. How does one reconcile this factor of 10 difference in latency measurements?
Would appreciate any help on this matter.
Prasoon
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
As clearly explained in the documentation for the Intel Memory Latency Checker, the "-e" flag instructs the code to leave the hardware prefetchers enabled.
The documentation includes a section "Measuring idle latency with Random Access" that provides examples of how to run latency tests when you don't have permission to disable the HW prefetchers.
Enlace copiado
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
As clearly explained in the documentation for the Intel Memory Latency Checker, the "-e" flag instructs the code to leave the hardware prefetchers enabled.
The documentation includes a section "Measuring idle latency with Random Access" that provides examples of how to run latency tests when you don't have permission to disable the HW prefetchers.
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
Thanks for the pointer, John. "./mlc -e --idle_latency -c4 -i2 -r" reports 78 ns latency. This seems to close the issue unless you have some more insights.
Prasoon
- Marcar como nuevo
- Favorito
- Suscribir
- Silenciar
- Suscribirse a un feed RSS
- Resaltar
- Imprimir
- Informe de contenido inapropiado
78 ns is not an unreasonably number for the Xeon E5-2667 v4 for two reasons:
- The frequency is high -- single-core Turbo is 3.6 GHz
- It is also very likely to be a "single-ring" configuration -- i.e., just the left hand side of the block diagram in Figure 1-2 of the "Intel Xeon Processor E5 and E7 v4 Product Families Uncore Performance Monitoring Reference Manual" (Intel document 334291). This means fewer "hops" to get from the core to the L3 slice that owns the address, then to the (single) Home Agent, then back to the L3 and core.
Measuring "latency" is extraordinarily complex in current systems -- mostly because you need to understand lots of poorly documented microarchitectural details to be able to define "latency" clearly enough to create a methodology. You need to understand the snooping option selected in the BIOS, the uncore frequency controls, the core frequency controls, the hardware prefetcher mechanisms and controls (where available), and the mapping of physical addresses to the DRAM channels, DRAM ranks (if applicable), and DRAM banks.
If you get the opportunity to run this as the root user, you should try both the original version and the randomized version without the "-e" flag.

- Suscribirse a un feed RSS
- Marcar tema como nuevo
- Marcar tema como leído
- Flotar este Tema para el usuario actual
- Favorito
- Suscribir
- Página de impresión sencilla