topic Re: Memory Latency Measurement Result in Software Tuning, Performance Optimization & Platform Monitoring

Memory Latency Measurement Result

Tianchen_Jerry_Wang — Sat, 20 Jan 2024 16:07:39 GMT

Hi Guys:

I was playing around with the Intel Memory Latency checker and later wanted to write my own version of the memory latency measurement program.

I know that we usually use pointer-chasing for memory latency measurement, but I want to try a simpler strategy of "flush cacheline--> record time --> mem read addr A --> finish record time".

I repeat the loop many times. From the results, I found three categories of latency:

80-100ns, 98% of the results
~150-300ns, 2% of the result
>> 1us, <0.1% of the result.

80-100ns seems a reasonable result for memory latency. The >>1us ones should mostly be caused by interrupts/page misses, etc.

What bothers me is those from 150-300us. They seem to happen periodically. Weakly aligned to cacheline size. The latency is too big for the DRAM close/open page policy difference, too small for the DRAM refresh interval, also too small for any interrupts.

I was suspecting that the "latency recording" would generate memory writing that interference with the DRAM latency". However, even after I remove this portion, from the "high_latency_ch0" stat it still shows ~2% of 150-300ns range.

Different machines behave slightly differently.

Here is my core function for measuring:

std::cout << "mem_latency experiment start" << std::endl; for (uint64_t i = 0; i < sample_count; i++){ clflushopt((void*)addr); mfence(); asm volatile ( "CPUID\n\t"/*serialize*/ "RDTSCP\n\t"/*read the clock*/ "mov %%edx, %0\n\t" "mov %%eax, %1\n\t" : "=r" (cycles_high), "=r"(cycles_low) :: "%rax", "%rbx", "%rcx", "%rdx"); *(volatile uint32_t*)addr; asm volatile ( "RDTSCP\n\t"/*read the clock*/ "mov %%edx, %0\n\t" "mov %%eax, %1\n\t" "CPUID\n\t": "=r" (cycles_high1), "=r" (cycles_low1):: "%rax", "%rbx", "%rcx", "%rdx"); clflushopt((void*)addr); mfence(); start1 = ( ((uint64_t)cycles_high << 32) | cycles_low ); end1 = ( ((uint64_t)cycles_high1 << 32) | cycles_low1 ); int32_t cycle_ch0 = static_cast<int32_t>((end1 - start1) - rdtsc_self_delay); sample_array_ch0[i] = cycle_ch0; //Comment out to see if latency recording cause DRAM latency interference high_latency_ch0 += (cycle_ch0 > (100 * cpu_ghz)); // addr = ori_addr + (i*4) % 4096; }

What is more interesting is, that sometimes, you could see some sort of alignment or pattern going on in the result. [Full csv in attachment]

I have tried to disable the Data-Dependent Prefetcher but it does not seem to be the reason. I also disabled the DCP and L2 Prefetcher in the BIOS, but it also does not seem to be related. [Well I am not sure if the prefetcher in BIOS is useful....]

Here is my CPU spec:

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 14 On-line CPU(s) list: 0-13 Thread(s) per core: 1 Core(s) per socket: 14 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz Stepping: 1 CPU MHz: 1200.178 CPU max MHz: 3200.0000 CPU min MHz: 1200.0000 BogoMIPS: 3999.97 L1d cache: 448 KiB L1i cache: 448 KiB L2 cache: 3.5 MiB L3 cache: 35 MiB NUMA node0 CPU(s): 0-13 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe s yscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni p clmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_d eadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ ppin ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts flush_l1d

Running out of methods already.... I was running it on Debian. I have only this user-level program running. Is it possible the backend kernel threads cause these....?

Thank you so much

Jerry

Re: Memory Latency Measurement Result

McCalpinJohn — Tue, 23 Jan 2024 01:21:09 GMT

Thanks for the interesting data! This was fun to think about.....

First I made a histogram of all of the results, showing a strong peak at 80 ns

a mean of 85.28 ns, and a standard deviation of 30.8 ns (including all data).

If I exclude all values greater than 121 ns the average drops to 80.14 ns and the standard

deviation drops to 4.2 ns.

Re-plotting the histogram on a log scale (for the counts), shows an approximately

constant pattern of counts averaging just under 1000 (out of 5 million total counts)

for each latency from about 120 ns to 350 ns. Between 350 and 390 ns the counts are

much lower (about 10), then increase again to about 250 between 380 and 400 ns. Counts

for latencies over 402 ns are all under 10.

The next thing I would want to look at is intervals between anomalously high

values. I can't do this from the data provided because of the code outside the

RDTSCP intervals (mostly CPUID).

In my experience using CPUID as a serialization operation is both excessively

disruptive to the processor pipeline and is not necessary for these sorts of

average timing benchmarks. RDTSCP is partially ordered, and in a long-running

test like this the OOO capability of the processor is not going to have a

significant impact on the latency measurements. The RDTSCP instruction does

take a variable number of cycles to complete, varying with both the Core

frequency and the cycle offset. If you are going to subtract off "rdtsc_self_delay"

I recommend that you run all the experiments with the both the core clock and the

uncore clock fixed to match the TSC clock. (Uncore clock is controlled by MSR 0x620

as described in Volume 4 of the SWDM.) Even fixing all these items may not be enough,

so you should check the RDTSCP overhead on your system to get a better feel for the

uncertainly introduced by this relatively long-running instruction. (On my only

Broadwell-EP system, a Xeon E5-2680 v4, I see RDTSCP overhead varying from 14 to 46

TSC cycles (with a 2.4 GHz TSC, the core running at 3.1 GHz and the uncore at 2.7 GHz).

Re: Memory Latency Measurement Result

Tianchen_Jerry_Wang — Thu, 25 Jan 2024 19:37:28 GMT

Hi McCalpinJohn:

Thank you so much for replying xD. I was worried that this weird situation might receive no reply at all xD. I will think about what I could change about the experiment according to your suggestions, and update you later. Thank you so much!

Best,

Jerry

Re: Memory Latency Measurement Result

Tianchen_Jerry_Wang — Fri, 26 Jan 2024 23:13:27 GMT

Hi McCalpinJohn:

I tried several things:

I fixed the core frequency to 2GHz, which is the TSC frequency on my machine.
I also fixed the uncore frequency by setting MSR 0x620 0x1b1b, which is 2.7GHz... I tried setting it to 0xa0a, which is 2GHz, The latency overall raises to an average of 106ns, but the pattern still exists.
I remove the CPUID instruction, and remove rdtsc_self_latency... because they should not cause a varience of more than 50ns...
Hmm I also draw a Histogram, but it seems we have way more than 1000 within 120ns to 350ns (x-Asix unit: "ns").
I also gather the interval between every consecutive high latency sample point and plot them in the Histogram (x-Asix unit: "sample points").

Thanks

Jerry

Re: Memory Latency Measurement Result

Tianchen_Jerry_Wang — Fri, 26 Jan 2024 23:46:33 GMT

Hi McCalpinJohn:

I just measure the time interval between every consecutive high latency measure, and the average is about 7us. This reminds me of the DRAM refresh interval. I believe DRAM issues refresh instructions roughly every 7us. This explains!

Every time DRAM needs to refresh certain rows, it will delay the memory op to the current bank.

Thank you

Jerry

Re: Memory Latency Measurement Result

McCalpinJohn — Sat, 27 Jan 2024 00:25:56 GMT

Leave it to me to make things harder than they need to be....

I did not notice that you were measuring latency for repeatedly loading the same address, so I was assuming the normal style of strided pointer-chasing and was looking for all sorts of complicated mechanisms that would explain variations in latency in that case.

In your case I think it is very simple -- you are seeing varying penalties due to random collisions with the memory refresh.

Both the magnitude of the added latency and the frequency of occurrence seem consistent with DDR4 refresh mechanisms. The details depend on the size of the DRAM die and the specific mode bit settings in the memory controller (e.g., Fine Granularity Refresh), but in rough terms:

The minimum delay between a REFresh command and the next ACTivate command for 8 Gbit die is 350 ns (from a Micron DDR4 device data sheet).
- The maximum delay to a load can be larger than this on both ends:
  - On the front end, the load can hit the memory controller before the REFresh cycle starts, but (due to previous "pushbacks" of the REFresh, the next REFresh cannot be delayed).
  - On the back end, after the REFRESH, there is a required ACTivate to open the target row, adding T_RCD to the latency. (Before the REFresh you might also see this, but at least there is a chance for open page hits in that case.)
- Worst case should be something like 350 ns (refresh) + 14 ns (T_RCD) + 65 ns (minimum observed latency) = 429 ns, which is very close to your observed worst case (only 3 of the 5 million loads showed higher latency).
If we assume no correlation between the Refresh timings and the load timings, you would expect to see a uniform distribution of added latency in the range of zero to the maximum overhead. This is what I see in the distribution of the original data in the range of 120 to 350 ns (with smaller probabilities for larger latency values).
At normal temperature the DRAM must be refreshed with an average interval of 7.8 microseconds, so a 350 ns delay corresponds to about 4.5% of the time. The original data showed 4.3% of the values at 100 ns or higher.

Other numbers apply to different DRAM die sizes, but I think this pretty clearly shows that something like this behavior is expected for standard DRAM refresh operations.