- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sir :
We have a question about Memory Latency Checker.
We used below system config to test and found that inject latency bandwidth lose too fast from 50 to 100.
Is there any reason for this?
What is the reasonable value for the test ?
1.System Config:
CPU: Intel(R) Xeon® Silver 4110 Processor 11M Cache, 2.10 GHz
Memory: Samsung RDIMM 32GB DDR4-2666 M393A4K40CB2-CTD
Speed: 2666MHz
Size: 32GB x 12 pcs
2.System Status/Issue detail:
Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay (ns) MB/sec
==========================
00000 130.31 135655.9
00002 130.34 135653.8
00008 130.37 135599.5
00015 129.67 135267.3
00050 123.71 131282.0
"00100 98.94 68357.8" <--- pass criteria : "over 81XXX"
00200 94.56 42695.4
00300 92.29 29473.3
00400 90.85 22580.9
00500 90.16 18355.0
00700 88.79 13463.5
01000 87.81 9723.8
01300 87.24 7680.3
01700 86.54 6068.3
02500 85.56 4383.3
03500 84.99 3354.4
05000 84.20 2583.5
09000 83.00 1785.2
20000 81.96 1237.6
3.Freq. of Occurrence/Time to Fail/Failure Rate:
1) 100%
4.Steps to Reproduce:
1)Boot into OS
2)Run MLC test
5.Tool
1)Intel(R) Memory Latency Checker v3.5
Thanks and best regards.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am not sure that you understand the purpose of this test....
This test is intended to give an estimate of the latency under load for different "background" memory bandwidth "load" levels. The specific values of bandwidth for each level of "injection delay" are going to be impossible to predict -- what you are looking for in the results is how much the average load latency increases as the DRAM bandwidth utilization increases. In your case, under the heaviest load (zero injection delay), the latency only increases from the minimum value of 82 ns to 130 ns. This is a very small increase in latency under maximum load. For the Xeon Gold 6142 I see the latency increase from 80 ns to 175 ns (at max load), and for the Xeon Platinum 8160 I see the latency increase from 73 ns to 234 ns (at max load).
Your latency increase under load is relatively small because you don't have enough cores to push the DRAM bandwidth to saturation -- the highest bandwidth value in your results is just under 60% of peak BW, while the Xeon Gold 6142 (16 core) and Xeon Platinum 8160 (24 core) deliver maximum bandwidths of about 87% of peak.
If you compute "latency (ns) * bandwidth (GB/s)", you get the average concurrency (in Bytes). Divide this by the number of cores and by 64 Bytes per cache line to get the number of outstanding cache misses per core. Your results show a maximum value of about 17, which is the same value that I see on the Xeon Platinum 8160. The Xeon Gold 6142 is only slightly different, with a maximum value of 19 outstanding cache lines per core. These numbers are consistent with other measurements of the maximum number of outstanding reads that a core can generate with the help of hardware prefetch. The Xeon Gold 6142 has enough cores (16 per chip) to reach asymptotic bandwidth (~220 GB/s) with injection delays of 50ns or less, and the average latency is then determined by the number of cacheline transfers outstanding. The Xeon Platinum 8160 (24 cores) has the same asymptotic bandwidth (~223 GB/s), but because it has more cores, it can reach this bandwidth level with injection delays of up to 100ns. Because there are more cores, more concurrent cacheline transfers can be in flight, so the average latency required for each transfer is proportionately larger.
I have performed this analysis for all three processors, and they all show stable values of concurrency per core for injection delays that are less than the idle memory latency. This is the expected result. For injection delay slightly larger than the idle memory latency, the Xeon Platinum 8160 shows negligible (<5%) drop in effective concurrency per core, while the Xeon Gold 6142 shows a much larger drop (~40%) in effective concurrency per core, and the Xeon Silver 4100 shows an even larger drop (~60%) in effective concurrency per core. The reasons for the decrease in effective concurrency are likely to be complex and it may not be possible to understand the details given publicly available information. There could be differences due to L2 HW prefetcher dynamic response to load, or due to energy/performance bias settings, or due to an increase in branch misprediction rate as the "injection delay" is increased, or due to memory controller dynamic behavior under load (including limits on open page duration).
The important results from this test on the Xeon Silver 4100 are: (1) There are not enough cores to drive the read bandwidth to more than about 59% of peak, and (2) The maximum average memory latency under maximum (read-only) memory load is only ~1.6 times the memory latency under very low loads.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello John,
I have one question on this particular command (loaded_latency).
My CPU is Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz & I am using 32 (0-31) cores.
& I did "modprobe msr"
./mlc --loaded_latency -W3
Intel(R) Memory Latency Checker - v3.8
Command line parameters: --loaded_latency -W3
Using buffer size of 100.000MiB/thread for reads and an additional 100.000MiB/thread for writes
Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Inject Latency Bandwidth
Delay (ns) MB/sec
==========================
00000 4072.48 13492.0
00002 4036.75 13579.2
00008 4022.40 13577.8
00015 4016.09 13578.3
00050 3988.20 13592.5
00100 3985.03 13591.3
00200 3990.11 13594.1
00300 3986.65 13593.0
00400 3984.45 13594.2
00500 3989.70 13592.3
00700 3987.57 13595.0
01000 3985.83 13595.5
01300 3984.78 13596.2
01700 3981.87 13597.6
02500 3982.67 13597.7
03500 3962.43 13600.7
05000 3913.11 13606.8
09000 206.88 10955.7
20000 158.64 5213.1
Here I see the trend, decrease in latency with increase in memory bendwidth (not comparing injection cycles 9000 & 20000).
Is this an expected behavior?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Those latency numbers are very high and the corresponding bandwidth numbers are very low -- 1/10th of the original poster's values and about 13x smaller than I would expect for a 2-socket Xeon Gold 5220 system. I would assume that either the system was very busy while the test was run, or somehow the benchmark was forced to run on a single core.
The values given in the original post in this thread are in the expected range. For small injection delay I would expect your Xeon Gold 5220 bandwidth and latency to be higher than in the original post because your processor has a lot more cores.
In general, the results of this "loaded memory latency" benchmark can be divided into three ranges:
- Near idle:
- For very high "inject delay", the memory is almost idle and the latency remains near the idle value.
- This should be in the 80 ns range for most Xeon Scalable processors (close to 70 ns for the faster models).
- Partially Loaded:
- As the injection delay is *decreased*, memory traffic (observed bandwidth) *increases*.
- As the memory traffic increases, queueing delays will be introduced in the memory subsystem, so the observed latency will increase steadily.
- The nature of the queueing delays will depend on the processor implementation. I would expect larger queueing delays if the background workload includes both reads and writes (as the "-W3" specifies).
- Fully Loaded:
- If the processor has enough cores generate enough cache misses to fully tolerate the memory latency, then you will reach some low value of injection delay for which the system delivers approximately asymptotic bandwidth.
- Lower values of injection delay will not increase the bandwidth (because it is already at its practical maximum), but they will increase latency -- more cache misses being serviced at the same bandwidth requires more time (i.e., more latency).
- The latency should be monotonically *increasing* as the injection delay is *decreased*, while the bandwidth will stay approximately constant. Variations in bandwidth in this regime may or may no be monotonic, but should be too small to be of practical importance.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page