For Xeon Phi, I am trying to understand best way to log latency from each core to each of the MCDRAM (8 EDC). Can anyone suggest better ways to do this?
The differences are very small.
The results at http://sites.utexas.edu/jdm4372/2016/12/06/memory-latency-on-the-intel-xeon-phi-x200-knights-landing... show that the average latency for local MCDRAM access in SNC4 mode (which is limited to one pair of MCDRAM controllers) is only 3.5ns (2.3%) lower than the average latency when loading from all MCDRAM controllers.
The latency on KNL (and Skylake Xeon) varies as much for different addresses as it does for different cores, because each physical addresses are hashed across the available coherence agents. It is possible to use performance counters to find a set of addresses that map to a specific CHA and to a specific EDC, then compute the dependent load latency for each core using addresses from that list. This will give a larger range of latencies than the usual approach of averaging over physical addresses that are mapped to many different CHAs.
But the latency for MCDRAM should be less than that for DDR4 given the bandwidth they provide?
If the argument is the distance from core to each of the EDCs, then MCDRAM latency for each EDC should be different? Which is not what you discuss in your blog?
There is no reason for MCDRAM latency to be lower than memory latency. In fact, many of the features used to enable increased bandwidth also cause increased latency. Examples include the use of multiple channels (requiring longer average traversals across the chip), increased buffer sizes, etc. For MCDRAM in particular, there is the added latency of two SERDES in each direction, since the interface between the MCDRAM stacks and the KNL chip runs at a higher frequency than the Embedded DRAM arrays on one side and the mesh on the other side.
In Knights Landing, memory latency is a function of the location of the core, the location of the CHA responsible for coherence for the physical address being accessed, and the location of the memory controller (EDC or MC) that owns the physical address being accessed. This is the minimum number of factors that apply. In an active system, the latency will also depend on the type of memory transaction (Read Shared, Read Exclusive, Read For Ownership), the state of the cache line in other L2 caches on the chip, contention on the address/acknowledge/data buses (which can take lots of forms, and which involves different paths through the mesh for each of the buses).
The results in by blog don't directly address the issue of latency to each memory controller because they average over all cores and all CHAs. The relatively small average latency differences between modes suggests that the cost of the "hops" is quite low in an idle system, but this is a mixture of hops from the core to the CHA, from the CHA to the memory controller, and from the memory controller back to the core. It would not be too hard to estimate the average number of hops in each mode and see if the average latency changes are consistent with a simple model of the latency including an integral number of uncore cycles per "hop".
The differences in throughput are much larger than the differences in latency, suggesting that the increased number of hops per transaction increases contention by a greater ratio than it increases latency. At one point I actually collected all the STREAM numbers for the different configurations, but apparently got distracted before putting them on the STREAM web site. Now I will need to go back and find the numbers again....