<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Memory Latency Measurement Result in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-Latency-Measurement-Result/m-p/1566519#M8300</link>
    <description>&lt;P&gt;Leave it to me to make things harder than they need to be....&lt;/P&gt;&lt;P&gt;I did not notice that you were measuring latency for repeatedly loading the same address, so I was assuming the normal style of strided pointer-chasing and was looking for all sorts of complicated mechanisms that would explain variations in latency in that case.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In your case I think it is very simple -- you are seeing varying penalties due to random collisions with the memory refresh.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Both the magnitude of the added latency and the frequency of occurrence seem consistent with DDR4 refresh mechanisms. &amp;nbsp;The details depend on the size of the DRAM die and the specific mode bit settings in the memory controller (e.g., Fine Granularity Refresh), but in rough terms:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;The minimum delay between a REFresh command and the next ACTivate command for 8 Gbit die is 350 ns (from a Micron DDR4 device data sheet).&amp;nbsp;&lt;UL&gt;&lt;LI&gt;The maximum delay to a load can be larger than this on both ends:&lt;UL&gt;&lt;LI&gt;On the front end, the load can hit the memory controller before the REFresh cycle starts, but (due to previous "pushbacks" of the REFresh, the next REFresh cannot be delayed).&lt;/LI&gt;&lt;LI&gt;On the back end, after the REFRESH, there is a required ACTivate to open the target row, adding T_RCD to the latency. &amp;nbsp;(Before the REFresh you might also see this, but at least there is a chance for open page hits in that case.)&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;Worst case should be something like 350 ns (refresh) + 14 ns (T_RCD) + 65 ns (minimum observed latency) = 429 ns, which is very close to your observed worst case (only 3 of the 5 million loads showed higher latency).&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;If we assume no correlation between the Refresh timings and the load timings, you would expect to see a uniform distribution of added latency in the range of zero to the maximum overhead. &amp;nbsp;This is what I see in the distribution of the original data in the range of 120 to 350 ns (with smaller probabilities for larger latency values).&lt;/LI&gt;&lt;LI&gt;At normal temperature the DRAM must be refreshed with an average interval of 7.8 microseconds, so a 350 ns delay corresponds to about 4.5% of the time. &amp;nbsp;The original data showed 4.3% of the values at 100 ns or higher.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Other numbers apply to different DRAM die sizes, but I think this pretty clearly shows that something like this behavior is expected for standard DRAM refresh operations.&lt;/P&gt;</description>
    <pubDate>Sat, 27 Jan 2024 00:25:56 GMT</pubDate>
    <dc:creator>McCalpinJohn</dc:creator>
    <dc:date>2024-01-27T00:25:56Z</dc:date>
    <item>
      <title>Memory Latency Measurement Result</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-Latency-Measurement-Result/m-p/1564588#M8290</link>
      <description>&lt;P&gt;Hi Guys:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I was playing around with the Intel Memory Latency checker and later wanted to write my own version of the memory latency measurement program.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I know that we usually use pointer-chasing for memory latency measurement, but I want to try a simpler strategy of "flush cacheline--&amp;gt; record time --&amp;gt; mem read addr A --&amp;gt; finish record time".&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I repeat the loop many times. From the results, I found three categories of latency:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;80-100ns, 98% of the results&lt;/LI&gt;&lt;LI&gt;~150-300ns, 2% of the result&lt;/LI&gt;&lt;LI&gt;&amp;gt;&amp;gt; 1us, &amp;lt;0.1% of the result.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;80-100ns seems a reasonable result for memory latency. The &amp;gt;&amp;gt;1us ones should mostly be caused by interrupts/page misses, etc.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;What bothers me is those from 150-300us. They seem to happen periodically. Weakly aligned to cacheline size. The latency is too big for the DRAM close/open page policy difference, too small for the DRAM refresh interval, also too small for any interrupts.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I was suspecting that the "latency recording" would generate memory writing that interference with the DRAM latency". However, even after I remove this portion, from the "high_latency_ch0" stat it still shows&amp;nbsp; ~2% of 150-300ns range.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Different machines behave slightly differently.&lt;/P&gt;&lt;P&gt;Here is my core function for measuring:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;    std::cout &amp;lt;&amp;lt; "mem_latency experiment start" &amp;lt;&amp;lt; std::endl;
  	for (uint64_t i = 0; i &amp;lt; sample_count; i++){

		clflushopt((void*)addr);
		mfence();

		asm volatile (
		"CPUID\n\t"/*serialize*/
		"RDTSCP\n\t"/*read the clock*/
		"mov %%edx, %0\n\t"
		"mov %%eax, %1\n\t"
		: "=r" (cycles_high), "=r"(cycles_low)
		:: "%rax", "%rbx", "%rcx", "%rdx");

		*(volatile uint32_t*)addr;

		asm volatile (
		"RDTSCP\n\t"/*read the clock*/
		"mov %%edx, %0\n\t"
		"mov %%eax, %1\n\t"
		"CPUID\n\t": "=r" (cycles_high1), "=r"
		(cycles_low1):: "%rax", "%rbx", "%rcx", "%rdx");

		clflushopt((void*)addr);
		mfence();


		start1 = ( ((uint64_t)cycles_high &amp;lt;&amp;lt; 32) | cycles_low );
		end1 = ( ((uint64_t)cycles_high1 &amp;lt;&amp;lt; 32) | cycles_low1 );
        int32_t cycle_ch0 = static_cast&amp;lt;int32_t&amp;gt;((end1 - start1) - rdtsc_self_delay);
		sample_array_ch0[i] = cycle_ch0; //Comment out to see if latency recording cause DRAM latency interference
        high_latency_ch0 += (cycle_ch0 &amp;gt; (100 * cpu_ghz)); 
		// addr = ori_addr + (i*4) % 4096; 
    }&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;What is more interesting is, that sometimes, you could see some sort of alignment or pattern going on in the result. [Full csv in attachment]&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Tianchen_Jerry_Wang_0-1705765366865.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/50598iE39CACFB8FAFCFB2/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="Tianchen_Jerry_Wang_0-1705765366865.png" alt="Tianchen_Jerry_Wang_0-1705765366865.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have tried to disable the Data-Dependent Prefetcher but it does not seem to be the reason. I also disabled the DCP and L2 Prefetcher in the BIOS, but it also does not seem to be related. [Well I am not sure if the prefetcher in BIOS is useful....]&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Here is my CPU spec:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      46 bits physical, 48 bits virtual
CPU(s):                             14
On-line CPU(s) list:                0-13
Thread(s) per core:                 1
Core(s) per socket:                 14
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              79
Model name:                         Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
Stepping:                           1
CPU MHz:                            1200.178
CPU max MHz:                        3200.0000
CPU min MHz:                        1200.0000
BogoMIPS:                           3999.97
L1d cache:                          448 KiB
L1i cache:                          448 KiB
L2 cache:                           3.5 MiB
L3 cache:                           35 MiB
NUMA node0 CPU(s):                  0-13
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                 Mitigation; PTE Inversion
Vulnerability Mds:                  Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe s
                                    yscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni p
                                    clmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_d
                                    eadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_
                                    ppin ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt
                                     xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts flush_l1d&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Running out of methods already.... I was running it on Debian. I have only this user-level program running. Is it possible the backend kernel threads cause these....?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you so much&lt;/P&gt;&lt;P&gt;Jerry&lt;/P&gt;</description>
      <pubDate>Sat, 20 Jan 2024 16:07:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-Latency-Measurement-Result/m-p/1564588#M8290</guid>
      <dc:creator>Tianchen_Jerry_Wang</dc:creator>
      <dc:date>2024-01-20T16:07:39Z</dc:date>
    </item>
    <item>
      <title>Re: Memory Latency Measurement Result</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-Latency-Measurement-Result/m-p/1565210#M8292</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Thanks for the interesting data! This was fun to think about.....&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;First I made a histogram of all of the results, showing a strong peak at 80 ns&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;a mean of 85.28 ns, and a standard deviation of 30.8 ns (including all data). &lt;SPAN class=""&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;If I exclude all values greater than 121 ns the average drops to 80.14 ns and the standard&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;deviation drops to 4.2 ns.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Re-plotting the histogram on a log scale (for the counts), shows an approximately&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;constant pattern of counts averaging just under 1000 (out of 5 million total counts)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;for each latency from about 120 ns to 350 ns.&lt;SPAN class=""&gt;&amp;nbsp; &lt;/SPAN&gt;Between 350 and 390 ns the counts are&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;much lower (about 10), then increase again to about 250 between 380 and 400 ns.&lt;SPAN class=""&gt;&amp;nbsp; &lt;/SPAN&gt;Counts&lt;SPAN class=""&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;for latencies over 402 ns are all under 10.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;The next thing I would want to look at is intervals between anomalously high&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;values. I can't do this from the data provided because of the code outside the&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;RDTSCP intervals (mostly CPUID).&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;In my experience using CPUID as a serialization operation is both excessively&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;disruptive to the processor pipeline and is not necessary for these sorts of&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;average timing benchmarks.&lt;SPAN class=""&gt;&amp;nbsp; &lt;/SPAN&gt;RDTSCP is partially ordered, and in a long-running&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;test like this the OOO capability of the processor is not going to have a&lt;SPAN class=""&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;significant impact on the latency measurements.&lt;SPAN class=""&gt;&amp;nbsp; &lt;/SPAN&gt;The RDTSCP instruction does&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;take a variable number of cycles to complete, varying with both the Core&lt;SPAN class=""&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;frequency and the cycle offset.&lt;SPAN class=""&gt;&amp;nbsp; &lt;/SPAN&gt;If you are going to subtract off "rdtsc_self_delay"&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I recommend that you run all the experiments with the both the core clock and the&lt;SPAN class=""&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;uncore clock fixed to match the TSC clock.&lt;SPAN class=""&gt;&amp;nbsp; &lt;/SPAN&gt;(Uncore clock is controlled by MSR 0x620&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;as described in Volume 4 of the SWDM.) Even fixing all these items may not be enough,&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;so you should check the RDTSCP overhead on your system to get a better feel for the&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;uncertainly introduced by this relatively long-running instruction. (On my only&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Broadwell-EP system, a Xeon E5-2680 v4, I see RDTSCP overhead varying from 14 to 46&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;TSC cycles (with a 2.4 GHz TSC, the core running at 3.1 GHz and the uncore at 2.7 GHz).&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 23 Jan 2024 01:21:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-Latency-Measurement-Result/m-p/1565210#M8292</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2024-01-23T01:21:09Z</dc:date>
    </item>
    <item>
      <title>Re: Memory Latency Measurement Result</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-Latency-Measurement-Result/m-p/1566172#M8296</link>
      <description>&lt;P&gt;Hi&amp;nbsp;McCalpinJohn:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you so much for replying xD. I was worried that this weird situation might receive no reply at all xD. I will think about what I could change about the experiment according to your suggestions, and update you later. Thank you so much!&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Best,&lt;/P&gt;&lt;P&gt;Jerry&lt;/P&gt;</description>
      <pubDate>Thu, 25 Jan 2024 19:37:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-Latency-Measurement-Result/m-p/1566172#M8296</guid>
      <dc:creator>Tianchen_Jerry_Wang</dc:creator>
      <dc:date>2024-01-25T19:37:28Z</dc:date>
    </item>
    <item>
      <title>Re: Memory Latency Measurement Result</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-Latency-Measurement-Result/m-p/1566509#M8298</link>
      <description>&lt;P&gt;Hi McCalpinJohn:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I tried several things:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;I fixed the core frequency to 2GHz, which is the TSC frequency on my machine.&lt;/LI&gt;&lt;LI&gt;I also fixed the uncore frequency by setting MSR 0x620 0x1b1b, which is 2.7GHz... I tried setting it to 0xa0a, which is 2GHz, The latency overall raises to an average of 106ns, but the pattern still exists.&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;I remove the CPUID instruction, and remove rdtsc_self_latency... because they should not cause a varience of more than 50ns...&lt;/LI&gt;&lt;LI&gt;Hmm I also draw a Histogram, but it seems we have way more than 1000 within 120ns to 350ns&amp;nbsp;(x-Asix unit: "ns").&lt;/LI&gt;&lt;LI&gt;I also gather the interval between every consecutive high latency sample point and plot them in the Histogram (x-Asix unit: "sample points").&lt;/LI&gt;&lt;LI&gt;&amp;nbsp;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Tianchen_Jerry_Wang_2-1706310758924.png" style="width: 738px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/50858i88788779A7856D9C/image-dimensions/738x570/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="738" height="570" role="button" title="Tianchen_Jerry_Wang_2-1706310758924.png" alt="Tianchen_Jerry_Wang_2-1706310758924.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Tianchen_Jerry_Wang_3-1706310789002.png" style="width: 736px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/50859i0D5892924AD9F05C/image-dimensions/736x515/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="736" height="515" role="button" title="Tianchen_Jerry_Wang_3-1706310789002.png" alt="Tianchen_Jerry_Wang_3-1706310789002.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;Jerry&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 26 Jan 2024 23:13:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-Latency-Measurement-Result/m-p/1566509#M8298</guid>
      <dc:creator>Tianchen_Jerry_Wang</dc:creator>
      <dc:date>2024-01-26T23:13:27Z</dc:date>
    </item>
    <item>
      <title>Re: Memory Latency Measurement Result</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-Latency-Measurement-Result/m-p/1566515#M8299</link>
      <description>&lt;P&gt;Hi McCalpinJohn:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I just measure the time interval between every consecutive high latency measure, and the average is about 7us. This reminds me of the DRAM refresh interval. I believe DRAM issues refresh instructions roughly every 7us. This explains!&lt;/P&gt;&lt;P&gt;Every time DRAM needs to refresh certain rows, it will delay the memory op to the current bank.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you&amp;nbsp;&lt;/P&gt;&lt;P&gt;Jerry&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 26 Jan 2024 23:46:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-Latency-Measurement-Result/m-p/1566515#M8299</guid>
      <dc:creator>Tianchen_Jerry_Wang</dc:creator>
      <dc:date>2024-01-26T23:46:33Z</dc:date>
    </item>
    <item>
      <title>Re: Memory Latency Measurement Result</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-Latency-Measurement-Result/m-p/1566519#M8300</link>
      <description>&lt;P&gt;Leave it to me to make things harder than they need to be....&lt;/P&gt;&lt;P&gt;I did not notice that you were measuring latency for repeatedly loading the same address, so I was assuming the normal style of strided pointer-chasing and was looking for all sorts of complicated mechanisms that would explain variations in latency in that case.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In your case I think it is very simple -- you are seeing varying penalties due to random collisions with the memory refresh.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Both the magnitude of the added latency and the frequency of occurrence seem consistent with DDR4 refresh mechanisms. &amp;nbsp;The details depend on the size of the DRAM die and the specific mode bit settings in the memory controller (e.g., Fine Granularity Refresh), but in rough terms:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;The minimum delay between a REFresh command and the next ACTivate command for 8 Gbit die is 350 ns (from a Micron DDR4 device data sheet).&amp;nbsp;&lt;UL&gt;&lt;LI&gt;The maximum delay to a load can be larger than this on both ends:&lt;UL&gt;&lt;LI&gt;On the front end, the load can hit the memory controller before the REFresh cycle starts, but (due to previous "pushbacks" of the REFresh, the next REFresh cannot be delayed).&lt;/LI&gt;&lt;LI&gt;On the back end, after the REFRESH, there is a required ACTivate to open the target row, adding T_RCD to the latency. &amp;nbsp;(Before the REFresh you might also see this, but at least there is a chance for open page hits in that case.)&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;Worst case should be something like 350 ns (refresh) + 14 ns (T_RCD) + 65 ns (minimum observed latency) = 429 ns, which is very close to your observed worst case (only 3 of the 5 million loads showed higher latency).&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;If we assume no correlation between the Refresh timings and the load timings, you would expect to see a uniform distribution of added latency in the range of zero to the maximum overhead. &amp;nbsp;This is what I see in the distribution of the original data in the range of 120 to 350 ns (with smaller probabilities for larger latency values).&lt;/LI&gt;&lt;LI&gt;At normal temperature the DRAM must be refreshed with an average interval of 7.8 microseconds, so a 350 ns delay corresponds to about 4.5% of the time. &amp;nbsp;The original data showed 4.3% of the values at 100 ns or higher.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Other numbers apply to different DRAM die sizes, but I think this pretty clearly shows that something like this behavior is expected for standard DRAM refresh operations.&lt;/P&gt;</description>
      <pubDate>Sat, 27 Jan 2024 00:25:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-Latency-Measurement-Result/m-p/1566519#M8300</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2024-01-27T00:25:56Z</dc:date>
    </item>
  </channel>
</rss>

