Intel® Memory Latency Checker v3.5 test

Lin__Roy · ‎04-10-2019

Hi Sir :

We have a question about Memory Latency Checker.

We used below system config to test and found that inject latency bandwidth lose too fast from 50 to 100.

Is there any reason for this?
What is the reasonable value for the test ?

1.System Config:
CPU: Intel(R) Xeon® Silver 4110 Processor 11M Cache, 2.10 GHz
Memory: Samsung RDIMM 32GB DDR4-2666 M393A4K40CB2-CTD
Speed: 2666MHz
Size: 32GB x 12 pcs

2.System Status/Issue detail:

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay   (ns)    MB/sec
==========================
00000 130.31   135655.9
00002 130.34   135653.8
00008 130.37   135599.5
00015 129.67   135267.3
00050 123.71   131282.0
"00100   98.94    68357.8" <--- pass criteria : "over 81XXX"
00200   94.56    42695.4
00300   92.29    29473.3
00400   90.85    22580.9
00500   90.16    18355.0
00700   88.79    13463.5
01000   87.81     9723.8
01300   87.24     7680.3
01700   86.54     6068.3
02500   85.56     4383.3
03500   84.99     3354.4
05000   84.20     2583.5
09000   83.00     1785.2
20000   81.96     1237.6

3.Freq. of Occurrence/Time to Fail/Failure Rate:
1) 100%

4.Steps to Reproduce:
1)Boot into OS
2)Run MLC test

5.Tool
1)Intel(R) Memory Latency Checker v3.5

Thanks and best regards.

McCalpinJohn · ‎04-17-2019

I am not sure that you understand the purpose of this test....

This test is intended to give an estimate of the latency under load for different "background" memory bandwidth "load" levels. The specific values of bandwidth for each level of "injection delay" are going to be impossible to predict -- what you are looking for in the results is how much the average load latency increases as the DRAM bandwidth utilization increases. In your case, under the heaviest load (zero injection delay), the latency only increases from the minimum value of 82 ns to 130 ns. This is a very small increase in latency under maximum load. For the Xeon Gold 6142 I see the latency increase from 80 ns to 175 ns (at max load), and for the Xeon Platinum 8160 I see the latency increase from 73 ns to 234 ns (at max load).

Your latency increase under load is relatively small because you don't have enough cores to push the DRAM bandwidth to saturation -- the highest bandwidth value in your results is just under 60% of peak BW, while the Xeon Gold 6142 (16 core) and Xeon Platinum 8160 (24 core) deliver maximum bandwidths of about 87% of peak.

If you compute "latency (ns) * bandwidth (GB/s)", you get the average concurrency (in Bytes). Divide this by the number of cores and by 64 Bytes per cache line to get the number of outstanding cache misses per core. Your results show a maximum value of about 17, which is the same value that I see on the Xeon Platinum 8160. The Xeon Gold 6142 is only slightly different, with a maximum value of 19 outstanding cache lines per core. These numbers are consistent with other measurements of the maximum number of outstanding reads that a core can generate with the help of hardware prefetch. The Xeon Gold 6142 has enough cores (16 per chip) to reach asymptotic bandwidth (~220 GB/s) with injection delays of 50ns or less, and the average latency is then determined by the number of cacheline transfers outstanding. The Xeon Platinum 8160 (24 cores) has the same asymptotic bandwidth (~223 GB/s), but because it has more cores, it can reach this bandwidth level with injection delays of up to 100ns. Because there are more cores, more concurrent cacheline transfers can be in flight, so the average latency required for each transfer is proportionately larger.

I have performed this analysis for all three processors, and they all show stable values of concurrency per core for injection delays that are less than the idle memory latency. This is the expected result. For injection delay slightly larger than the idle memory latency, the Xeon Platinum 8160 shows negligible (<5%) drop in effective concurrency per core, while the Xeon Gold 6142 shows a much larger drop (~40%) in effective concurrency per core, and the Xeon Silver 4100 shows an even larger drop (~60%) in effective concurrency per core. The reasons for the decrease in effective concurrency are likely to be complex and it may not be possible to understand the details given publicly available information. There could be differences due to L2 HW prefetcher dynamic response to load, or due to energy/performance bias settings, or due to an increase in branch misprediction rate as the "injection delay" is increased, or due to memory controller dynamic behavior under load (including limits on open page duration).

The important results from this test on the Xeon Silver 4100 are: (1) There are not enough cores to drive the read bandwidth to more than about 59% of peak, and (2) The maximum average memory latency under maximum (read-only) memory load is only ~1.6 times the memory latency under very low loads.

DAS__SANJIB1 · ‎06-01-2020

Hello John,

I have one question on this particular command (loaded_latency).

My CPU is Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz & I am using 32 (0-31) cores.

& I did "modprobe msr"

./mlc --loaded_latency -W3
Intel(R) Memory Latency Checker - v3.8
Command line parameters: --loaded_latency -W3

Using buffer size of 100.000MiB/thread for reads and an additional 100.000MiB/thread for writes

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Inject   Latency   Bandwidth
Delay   (ns)   MB/sec
==========================
00000   4072.48   13492.0
00002   4036.75   13579.2
00008   4022.40   13577.8
00015   4016.09   13578.3
00050   3988.20   13592.5
00100   3985.03   13591.3
00200   3990.11   13594.1
00300   3986.65   13593.0
00400   3984.45   13594.2
00500   3989.70   13592.3
00700   3987.57   13595.0
01000   3985.83   13595.5
01300   3984.78   13596.2
01700   3981.87   13597.6
02500   3982.67   13597.7
03500   3962.43   13600.7
05000   3913.11   13606.8

09000 206.88 10955.7
20000 158.64 5213.1

Here I see the trend, decrease in latency with increase in memory bendwidth (not comparing injection cycles 9000 & 20000).

Is this an expected behavior?

McCalpinJohn · ‎06-02-2020

Those latency numbers are very high and the corresponding bandwidth numbers are very low -- 1/10th of the original poster's values and about 13x smaller than I would expect for a 2-socket Xeon Gold 5220 system. I would assume that either the system was very busy while the test was run, or somehow the benchmark was forced to run on a single core.

The values given in the original post in this thread are in the expected range. For small injection delay I would expect your Xeon Gold 5220 bandwidth and latency to be higher than in the original post because your processor has a lot more cores.

In general, the results of this "loaded memory latency" benchmark can be divided into three ranges:

Near idle:
- For very high "inject delay", the memory is almost idle and the latency remains near the idle value.
- This should be in the 80 ns range for most Xeon Scalable processors (close to 70 ns for the faster models).
Partially Loaded:
- As the injection delay is *decreased*, memory traffic (observed bandwidth) *increases*.
- As the memory traffic increases, queueing delays will be introduced in the memory subsystem, so the observed latency will increase steadily.
- The nature of the queueing delays will depend on the processor implementation. I would expect larger queueing delays if the background workload includes both reads and writes (as the "-W3" specifies).
Fully Loaded:
- If the processor has enough cores generate enough cache misses to fully tolerate the memory latency, then you will reach some low value of injection delay for which the system delivers approximately asymptotic bandwidth.
- Lower values of injection delay will not increase the bandwidth (because it is already at its practical maximum), but they will increase latency -- more cache misses being serviced at the same bandwidth requires more time (i.e., more latency).
- The latency should be monotonically *increasing* as the injection delay is *decreased*, while the bandwidth will stay approximately constant. Variations in bandwidth in this regime may or may no be monotonic, but should be too small to be of practical importance.