<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Memory Latency Checker Measurement Results are Different in Different NUMA Nodes in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-Latency-Checker-Measurement-Results-are-Different-in/m-p/1579149#M8320</link>
    <description>&lt;P&gt;The test environment uses dual-channel Intel(R) Xeon(R) Silver 4410Y, turns on NUMA, and the configuration is as follows:&lt;/P&gt;&lt;P&gt;Each CPU has 12 physical cores and each physical core has 2 hyper-threads;&lt;/P&gt;&lt;P&gt;Clock frequency: 2.10GHz&lt;/P&gt;&lt;P&gt;L1 dcache: 1.1 MB&lt;/P&gt;&lt;P&gt;L1 icache: 768 KB&lt;/P&gt;&lt;P&gt;L2 cache: 48MB&lt;/P&gt;&lt;P&gt;L3 cache: 60MB&lt;/P&gt;&lt;P&gt;Local DDR Memory Type: DDR5-4800, Sumsung;&lt;/P&gt;&lt;P&gt;Local CXL Memory Type: DDR4-3200, Micron;&lt;/P&gt;&lt;P&gt;Local DDR Memory Channel: 8&lt;/P&gt;&lt;P&gt;Local CXL Memory Channel: 2&lt;/P&gt;&lt;P&gt;Local DDR Memory Size: 32Gx8 per channel, 1 on each channel&lt;/P&gt;&lt;P&gt;Local CXL Memory Size: Single channel 16G, only 1 memory module, single channel&lt;/P&gt;&lt;P&gt;The software environment configuration is as follows:&lt;/P&gt;&lt;P&gt;OS: Linux 6.15.4&lt;/P&gt;&lt;P&gt;GCC: 11.4.0&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;2 issues I discovered while testing my current server using MLC:&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;&lt;U&gt;&lt;STRONG&gt;Question 1:&lt;/STRONG&gt; &lt;/U&gt;&lt;/EM&gt;When the CPU on the NUMA0 node accesses the DDR memory on the NUMA1 node, as the number of parallel threads increases, the bandwidth gradually increases to the bottleneck, and the delay also increases; but conversely, the CPU on the NUMA1 node accesses the DDR memory on the NUMA0 node. When using DDR memory, the bandwidth increases to the bottleneck, but the latency does not change significantly, with a very small increase. The scripts used are the same except that the bound CPU and memory are on different NUMA nodes. What may be the reason for this?&lt;BR /&gt;&lt;BR /&gt;&lt;U&gt;&lt;EM&gt;&lt;STRONG&gt;Question 2:&lt;/STRONG&gt; &lt;/EM&gt;&lt;/U&gt;Is the latency of the CPU on the NUMA1 node accessing local memory higher than Remote DDR?&lt;BR /&gt;&lt;BR /&gt;The monitoring tool shows that there is no problem with the binding of CPU and memory. The principle of MLC delay test is to create a timer at the beginning of the test, then continuously execute the load instruction to access the memory of the specified size, and record the number of load instructions run during this stage and the total running time.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="pudding_art_0-1710054890617.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/52437i43537631B7A09B5D/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="pudding_art_0-1710054890617.png" alt="pudding_art_0-1710054890617.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="pudding_art_1-1710054898657.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/52438i77AC99298DDF1F7F/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="pudding_art_1-1710054898657.png" alt="pudding_art_1-1710054898657.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The test script for CPU binding on the NUMA1 node is as follows:&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;#!/bin/bash&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;MLC&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"./mlc"&lt;/SPAN&gt; &lt;SPAN&gt;# MLC command&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;NUMA&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"numactl"&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;OPERATION&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"R"&lt;/SPAN&gt; &lt;SPAN&gt;# Read-only as default&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;DRATION&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;5&lt;/SPAN&gt; &lt;SPAN&gt;# Run 5 seconds as default&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;LOW&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;12&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;BUFFER&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;200000&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;function&lt;/SPAN&gt; &lt;SPAN&gt;memorybw_core&lt;/SPAN&gt;&lt;SPAN&gt;()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;core_count&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;$1&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;MEM_BIND&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;$2&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;mask&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;0&lt;/SPAN&gt;&lt;SPAN&gt;x$(&lt;/SPAN&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt;&lt;SPAN&gt; "obase=16;2^(${core_count}+1)-2^${LOW}" &lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt; &lt;SPAN&gt;bc&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;bw&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;$($NUMA -m${MEM_BIND} ${MLC} --loaded_latency -d0 -${OPERATION} -b${BUFFER} -t${DRATION} -m$mask &lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt; &lt;SPAN&gt;grep&lt;/SPAN&gt; &lt;SPAN&gt;00000&lt;/SPAN&gt; &lt;SPAN&gt;|&lt;/SPAN&gt; &lt;SPAN&gt;awk&lt;/SPAN&gt;&lt;SPAN&gt; '{print $2, $3}' &lt;/SPAN&gt;&lt;SPAN&gt;&amp;gt;&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt; numa1cpu_latency.txt)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt;&lt;SPAN&gt; $bw&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; MEM_BIND &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt;&lt;SPAN&gt; {&lt;/SPAN&gt;&lt;SPAN&gt;0..2}&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;do&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt; &lt;SPAN&gt;"Memory loc in NUMA${MEM_BIND}"&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt; &lt;SPAN&gt;"NUMA ${MEM_BIND}"&lt;/SPAN&gt; &lt;SPAN&gt;&amp;gt;&amp;gt;&lt;/SPAN&gt; &lt;SPAN&gt;numa1cpu_latency_bw.txt&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; i &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt;&lt;SPAN&gt; {&lt;/SPAN&gt;&lt;SPAN&gt;12..23}&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;do&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;bw&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;$(&lt;/SPAN&gt;&lt;SPAN&gt;memorybw_core&lt;/SPAN&gt;&lt;SPAN&gt; $i ${MEM_BIND})&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt; &lt;SPAN&gt;"#$i done!"&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;done&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;done&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The test script for CPU binding on the NUMA0 node is as follows:&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;#!/bin/bash&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;MLC&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"./mlc"&lt;/SPAN&gt; &lt;SPAN&gt;# MLC command&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;OPERATION&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"R"&lt;/SPAN&gt; &lt;SPAN&gt;# Read-only as default&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;DRATION&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;5&lt;/SPAN&gt; &lt;SPAN&gt;# Run 5 seconds as default&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;BUFFER&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;200000&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;function&lt;/SPAN&gt; &lt;SPAN&gt;memorybw_core&lt;/SPAN&gt;&lt;SPAN&gt;()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;core_count&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;$1&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;mask&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;0&lt;/SPAN&gt;&lt;SPAN&gt;x$(&lt;/SPAN&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt;&lt;SPAN&gt; "obase=16;2^${core_count}-1" &lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt; &lt;SPAN&gt;bc&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;bw&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;$(${MLC} --loaded_latency -d0 -${OPERATION} -b${BUFFER} -t${DRATION} -m$mask &lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt; &lt;SPAN&gt;grep&lt;/SPAN&gt; &lt;SPAN&gt;00000&lt;/SPAN&gt; &lt;SPAN&gt;|&lt;/SPAN&gt; &lt;SPAN&gt;awk&lt;/SPAN&gt;&lt;SPAN&gt; '{print $2, $3}' &lt;/SPAN&gt;&lt;SPAN&gt;&amp;gt;&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt; numa0cpu_latency_bw.txt)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# -d : Specify load injection delay. A value of 0 for -d may provide maximum throughput. &lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# OPERATION: read-write ratio&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# -t : Set time in seconds during which each measurement is captured.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# -T : Specify this flag if only b/w is desired without latency values.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# -m : Specify the mask value (in hex) of CPUs to run the bandwidth generation threads.CPU0 should be excluded from this mask \&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# as it is used to run the latency measuring thread. If Intel Hyper Threading Technology is enabled, the other CPU that is part\&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# of physical core 0 should also be ommitted from this mask.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt;&lt;SPAN&gt; $bw&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;function&lt;/SPAN&gt; &lt;SPAN&gt;get_core_counts&lt;/SPAN&gt;&lt;SPAN&gt;()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# total physical cores number&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;skt&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;$(&lt;/SPAN&gt;&lt;SPAN&gt;lscpu&lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt;&lt;SPAN&gt;grep&lt;/SPAN&gt;&lt;SPAN&gt; "Socket(s)" &lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt; &lt;SPAN&gt;awk&lt;/SPAN&gt; &lt;SPAN&gt;-F:&lt;/SPAN&gt;&lt;SPAN&gt; '{print $2}')&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;cps&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;$(&lt;/SPAN&gt;&lt;SPAN&gt;lscpu&lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt;&lt;SPAN&gt;grep&lt;/SPAN&gt;&lt;SPAN&gt; "Core(s) per socket" &lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt; &lt;SPAN&gt;awk&lt;/SPAN&gt; &lt;SPAN&gt;-F:&lt;/SPAN&gt;&lt;SPAN&gt; '{print $2}')&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt; &lt;SPAN&gt;$(($skt&lt;/SPAN&gt;&lt;SPAN&gt;*&lt;/SPAN&gt;&lt;SPAN&gt;$cps))&lt;/SPAN&gt; &lt;SPAN&gt;# sockets x cores per socket&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;function&lt;/SPAN&gt; &lt;SPAN&gt;get_single_core_counts&lt;/SPAN&gt;&lt;SPAN&gt;()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt; &lt;SPAN&gt;$(&lt;/SPAN&gt;&lt;SPAN&gt;lscpu&lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt;&lt;SPAN&gt;grep&lt;/SPAN&gt;&lt;SPAN&gt; "Core(s) per socket" &lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt; &lt;SPAN&gt;awk&lt;/SPAN&gt; &lt;SPAN&gt;-F:&lt;/SPAN&gt;&lt;SPAN&gt; '{print $2}')&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; MEM_BIND &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt;&lt;SPAN&gt; {&lt;/SPAN&gt;&lt;SPAN&gt;0..2}&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;do&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt; &lt;SPAN&gt;"Memory loc in NUMA${MEM_BIND}"&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt; &lt;SPAN&gt;"NUMA${MEM_BIND}"&lt;/SPAN&gt; &lt;SPAN&gt;&amp;gt;&amp;gt;&lt;/SPAN&gt; &lt;SPAN&gt;numa0cpu_latency_bw.txt&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; i &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt; &lt;SPAN&gt;$(&lt;/SPAN&gt;&lt;SPAN&gt;seq&lt;/SPAN&gt; &lt;SPAN&gt;1&lt;/SPAN&gt;&lt;SPAN&gt; $(&lt;/SPAN&gt;&lt;SPAN&gt;get_single_core_counts&lt;/SPAN&gt;&lt;SPAN&gt;))&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;do&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;bw&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;$(&lt;/SPAN&gt;&lt;SPAN&gt;memorybw_core&lt;/SPAN&gt;&lt;SPAN&gt; $i $MEM_BIND)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt; &lt;SPAN&gt;"#$i done!"&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;done&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;done&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/DIV&gt;&lt;P&gt;&lt;BR /&gt;What is the reason for this phenomenon? Is there a problem with my test script or do I not understand the principles of MLC testing? Please reply as soon as possible.&lt;/P&gt;</description>
    <pubDate>Sun, 10 Mar 2024 07:21:25 GMT</pubDate>
    <dc:creator>pudding_art</dc:creator>
    <dc:date>2024-03-10T07:21:25Z</dc:date>
    <item>
      <title>Memory Latency Checker Measurement Results are Different in Different NUMA Nodes</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Memory-Latency-Checker-Measurement-Results-are-Different-in/m-p/1579149#M8320</link>
      <description>&lt;P&gt;The test environment uses dual-channel Intel(R) Xeon(R) Silver 4410Y, turns on NUMA, and the configuration is as follows:&lt;/P&gt;&lt;P&gt;Each CPU has 12 physical cores and each physical core has 2 hyper-threads;&lt;/P&gt;&lt;P&gt;Clock frequency: 2.10GHz&lt;/P&gt;&lt;P&gt;L1 dcache: 1.1 MB&lt;/P&gt;&lt;P&gt;L1 icache: 768 KB&lt;/P&gt;&lt;P&gt;L2 cache: 48MB&lt;/P&gt;&lt;P&gt;L3 cache: 60MB&lt;/P&gt;&lt;P&gt;Local DDR Memory Type: DDR5-4800, Sumsung;&lt;/P&gt;&lt;P&gt;Local CXL Memory Type: DDR4-3200, Micron;&lt;/P&gt;&lt;P&gt;Local DDR Memory Channel: 8&lt;/P&gt;&lt;P&gt;Local CXL Memory Channel: 2&lt;/P&gt;&lt;P&gt;Local DDR Memory Size: 32Gx8 per channel, 1 on each channel&lt;/P&gt;&lt;P&gt;Local CXL Memory Size: Single channel 16G, only 1 memory module, single channel&lt;/P&gt;&lt;P&gt;The software environment configuration is as follows:&lt;/P&gt;&lt;P&gt;OS: Linux 6.15.4&lt;/P&gt;&lt;P&gt;GCC: 11.4.0&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;2 issues I discovered while testing my current server using MLC:&lt;BR /&gt;&lt;BR /&gt;&lt;EM&gt;&lt;U&gt;&lt;STRONG&gt;Question 1:&lt;/STRONG&gt; &lt;/U&gt;&lt;/EM&gt;When the CPU on the NUMA0 node accesses the DDR memory on the NUMA1 node, as the number of parallel threads increases, the bandwidth gradually increases to the bottleneck, and the delay also increases; but conversely, the CPU on the NUMA1 node accesses the DDR memory on the NUMA0 node. When using DDR memory, the bandwidth increases to the bottleneck, but the latency does not change significantly, with a very small increase. The scripts used are the same except that the bound CPU and memory are on different NUMA nodes. What may be the reason for this?&lt;BR /&gt;&lt;BR /&gt;&lt;U&gt;&lt;EM&gt;&lt;STRONG&gt;Question 2:&lt;/STRONG&gt; &lt;/EM&gt;&lt;/U&gt;Is the latency of the CPU on the NUMA1 node accessing local memory higher than Remote DDR?&lt;BR /&gt;&lt;BR /&gt;The monitoring tool shows that there is no problem with the binding of CPU and memory. The principle of MLC delay test is to create a timer at the beginning of the test, then continuously execute the load instruction to access the memory of the specified size, and record the number of load instructions run during this stage and the total running time.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="pudding_art_0-1710054890617.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/52437i43537631B7A09B5D/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="pudding_art_0-1710054890617.png" alt="pudding_art_0-1710054890617.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="pudding_art_1-1710054898657.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/52438i77AC99298DDF1F7F/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="pudding_art_1-1710054898657.png" alt="pudding_art_1-1710054898657.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The test script for CPU binding on the NUMA1 node is as follows:&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;#!/bin/bash&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;MLC&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"./mlc"&lt;/SPAN&gt; &lt;SPAN&gt;# MLC command&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;NUMA&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"numactl"&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;OPERATION&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"R"&lt;/SPAN&gt; &lt;SPAN&gt;# Read-only as default&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;DRATION&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;5&lt;/SPAN&gt; &lt;SPAN&gt;# Run 5 seconds as default&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;LOW&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;12&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;BUFFER&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;200000&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;function&lt;/SPAN&gt; &lt;SPAN&gt;memorybw_core&lt;/SPAN&gt;&lt;SPAN&gt;()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;core_count&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;$1&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;MEM_BIND&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;$2&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;mask&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;0&lt;/SPAN&gt;&lt;SPAN&gt;x$(&lt;/SPAN&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt;&lt;SPAN&gt; "obase=16;2^(${core_count}+1)-2^${LOW}" &lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt; &lt;SPAN&gt;bc&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;bw&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;$($NUMA -m${MEM_BIND} ${MLC} --loaded_latency -d0 -${OPERATION} -b${BUFFER} -t${DRATION} -m$mask &lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt; &lt;SPAN&gt;grep&lt;/SPAN&gt; &lt;SPAN&gt;00000&lt;/SPAN&gt; &lt;SPAN&gt;|&lt;/SPAN&gt; &lt;SPAN&gt;awk&lt;/SPAN&gt;&lt;SPAN&gt; '{print $2, $3}' &lt;/SPAN&gt;&lt;SPAN&gt;&amp;gt;&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt; numa1cpu_latency.txt)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt;&lt;SPAN&gt; $bw&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; MEM_BIND &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt;&lt;SPAN&gt; {&lt;/SPAN&gt;&lt;SPAN&gt;0..2}&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;do&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt; &lt;SPAN&gt;"Memory loc in NUMA${MEM_BIND}"&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt; &lt;SPAN&gt;"NUMA ${MEM_BIND}"&lt;/SPAN&gt; &lt;SPAN&gt;&amp;gt;&amp;gt;&lt;/SPAN&gt; &lt;SPAN&gt;numa1cpu_latency_bw.txt&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; i &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt;&lt;SPAN&gt; {&lt;/SPAN&gt;&lt;SPAN&gt;12..23}&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;do&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;bw&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;$(&lt;/SPAN&gt;&lt;SPAN&gt;memorybw_core&lt;/SPAN&gt;&lt;SPAN&gt; $i ${MEM_BIND})&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt; &lt;SPAN&gt;"#$i done!"&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;done&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;done&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The test script for CPU binding on the NUMA0 node is as follows:&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;#!/bin/bash&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;MLC&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"./mlc"&lt;/SPAN&gt; &lt;SPAN&gt;# MLC command&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;OPERATION&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"R"&lt;/SPAN&gt; &lt;SPAN&gt;# Read-only as default&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;DRATION&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;5&lt;/SPAN&gt; &lt;SPAN&gt;# Run 5 seconds as default&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;BUFFER&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;200000&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;function&lt;/SPAN&gt; &lt;SPAN&gt;memorybw_core&lt;/SPAN&gt;&lt;SPAN&gt;()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;core_count&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;$1&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;mask&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;0&lt;/SPAN&gt;&lt;SPAN&gt;x$(&lt;/SPAN&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt;&lt;SPAN&gt; "obase=16;2^${core_count}-1" &lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt; &lt;SPAN&gt;bc&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;bw&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;$(${MLC} --loaded_latency -d0 -${OPERATION} -b${BUFFER} -t${DRATION} -m$mask &lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt; &lt;SPAN&gt;grep&lt;/SPAN&gt; &lt;SPAN&gt;00000&lt;/SPAN&gt; &lt;SPAN&gt;|&lt;/SPAN&gt; &lt;SPAN&gt;awk&lt;/SPAN&gt;&lt;SPAN&gt; '{print $2, $3}' &lt;/SPAN&gt;&lt;SPAN&gt;&amp;gt;&amp;gt;&lt;/SPAN&gt;&lt;SPAN&gt; numa0cpu_latency_bw.txt)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# -d : Specify load injection delay. A value of 0 for -d may provide maximum throughput. &lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# OPERATION: read-write ratio&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# -t : Set time in seconds during which each measurement is captured.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# -T : Specify this flag if only b/w is desired without latency values.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# -m : Specify the mask value (in hex) of CPUs to run the bandwidth generation threads.CPU0 should be excluded from this mask \&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# as it is used to run the latency measuring thread. If Intel Hyper Threading Technology is enabled, the other CPU that is part\&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# of physical core 0 should also be ommitted from this mask.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt;&lt;SPAN&gt; $bw&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;function&lt;/SPAN&gt; &lt;SPAN&gt;get_core_counts&lt;/SPAN&gt;&lt;SPAN&gt;()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;# total physical cores number&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;skt&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;$(&lt;/SPAN&gt;&lt;SPAN&gt;lscpu&lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt;&lt;SPAN&gt;grep&lt;/SPAN&gt;&lt;SPAN&gt; "Socket(s)" &lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt; &lt;SPAN&gt;awk&lt;/SPAN&gt; &lt;SPAN&gt;-F:&lt;/SPAN&gt;&lt;SPAN&gt; '{print $2}')&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;cps&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;$(&lt;/SPAN&gt;&lt;SPAN&gt;lscpu&lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt;&lt;SPAN&gt;grep&lt;/SPAN&gt;&lt;SPAN&gt; "Core(s) per socket" &lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt; &lt;SPAN&gt;awk&lt;/SPAN&gt; &lt;SPAN&gt;-F:&lt;/SPAN&gt;&lt;SPAN&gt; '{print $2}')&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt; &lt;SPAN&gt;$(($skt&lt;/SPAN&gt;&lt;SPAN&gt;*&lt;/SPAN&gt;&lt;SPAN&gt;$cps))&lt;/SPAN&gt; &lt;SPAN&gt;# sockets x cores per socket&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;function&lt;/SPAN&gt; &lt;SPAN&gt;get_single_core_counts&lt;/SPAN&gt;&lt;SPAN&gt;()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt; &lt;SPAN&gt;$(&lt;/SPAN&gt;&lt;SPAN&gt;lscpu&lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt;&lt;SPAN&gt;grep&lt;/SPAN&gt;&lt;SPAN&gt; "Core(s) per socket" &lt;/SPAN&gt;&lt;SPAN&gt;|&lt;/SPAN&gt; &lt;SPAN&gt;awk&lt;/SPAN&gt; &lt;SPAN&gt;-F:&lt;/SPAN&gt;&lt;SPAN&gt; '{print $2}')&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; MEM_BIND &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt;&lt;SPAN&gt; {&lt;/SPAN&gt;&lt;SPAN&gt;0..2}&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;do&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt; &lt;SPAN&gt;"Memory loc in NUMA${MEM_BIND}"&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt; &lt;SPAN&gt;"NUMA${MEM_BIND}"&lt;/SPAN&gt; &lt;SPAN&gt;&amp;gt;&amp;gt;&lt;/SPAN&gt; &lt;SPAN&gt;numa0cpu_latency_bw.txt&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; i &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt; &lt;SPAN&gt;$(&lt;/SPAN&gt;&lt;SPAN&gt;seq&lt;/SPAN&gt; &lt;SPAN&gt;1&lt;/SPAN&gt;&lt;SPAN&gt; $(&lt;/SPAN&gt;&lt;SPAN&gt;get_single_core_counts&lt;/SPAN&gt;&lt;SPAN&gt;))&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;do&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;bw&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;$(&lt;/SPAN&gt;&lt;SPAN&gt;memorybw_core&lt;/SPAN&gt;&lt;SPAN&gt; $i $MEM_BIND)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;echo&lt;/SPAN&gt; &lt;SPAN&gt;"#$i done!"&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;done&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;done&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/DIV&gt;&lt;P&gt;&lt;BR /&gt;What is the reason for this phenomenon? Is there a problem with my test script or do I not understand the principles of MLC testing? Please reply as soon as possible.&lt;/P&gt;</description>
      <pubDate>Sun, 10 Mar 2024 07:21:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Memory-Latency-Checker-Measurement-Results-are-Different-in/m-p/1579149#M8320</guid>
      <dc:creator>pudding_art</dc:creator>
      <dc:date>2024-03-10T07:21:25Z</dc:date>
    </item>
  </channel>
</rss>

