the mlc is quite a practical tool. Can someone please explain how the idle latency and BW figures are calculated and how they are being obtained ?
Are the idle figures for latency and BW the best possible (i.e., min and max, respectively) that can be attained on a platform or is there any averaging out of the samples?
I would like to be able to obtain the same metrics not only for DRAM but also for all caches in the hierarchy. Is this feasible?
Is there a version of the mlc tool that runs natively on Intel Phis?
The latency values are reproducible using a standard pointer-chasing benchmark. My notes are at https://software.intel.com/en-us/forums/topic/517571#comment-1793620
The 67.0 ns I reported was the minimum average latency for 1 million loads across a small number of trials (I usually use 5). The latencies for the individual loads are going to vary slightly depending on TLB and DRAM behavior, but the tests are designed so that the majority of the loads are "optimal". For example, the first load into each 4KiB page may show a slightly higher latency since the DRAM row has to be Activated first, but the remaining loads in the page won't experience this overhead, so it will be averaged almost to invisibility.
For the bandwidth tests it is not possible to say that these are the maximum possible values, since it is not possible to prove that the code being used for each test is "optimal" for the platform. I suspect that the methodology is similar to STREAM -- run the test several times and report the best result across that ensemble. On a properly configured and idle system these values are typically quite stable -- most values are very similar, with a few slower results due to random OS interference. For example on my Xeon E5-2680 systems (Sandy Bridge EP, 2-socket, 8 cores/socket, HT disabled, 3.1 GHz max all-core Turbo frequency sustained), even when using all 16 cores the average time for 10 trials of each of the STREAM benchmark kernels was almost never more than 0.5% slower than the best time for each kernel.
On my Xeon E5-2680 systems, I have been able to get slightly higher results with the STREAM benchmark than MLC reports for the "Stream [sic] Triad-like" kernel. MLC reports values around 75872 MB/s, while I have been able to obtain values over 78270 MB/s (3.16% higher) by varying the array size, array offset, compilation options, etc. So the MLC result is a very good one, but not necessarily the best possible result. Without knowing the details, I would guess that the same probably applies to the other bandwidth kernels -- a broader systematic search of the implementation parameter space might give another 1%-3% for some of those tests.
Knowing the maximum sustainable values for other levels of the memory hierarchy is certainly interesting as well, but unfortunately the "optimum" coding is often different for different levels of the cache. I am still trying to understand the results I have obtained, but for STREAM Triad, I have measured up to 42 Bytes/cycle for L1-contained data (7/8 of the peak of 48 Bytes/cycle on the Sandy Bridge core), up to 14 Bytes/cycle for L2-contained data (quite a bit lower than the 32 Bytes/cycle peak), up to 8 Bytes per cycle for L3-contained data (also quite a bit lower than the 32 Bytes/cycle peak), and up to 56 Bytes per cycle for L3-contained data using all 8 cores (7x speedup).