<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic An interesting issue when testing inter-socket latency on skylake processor in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/An-interesting-issue-when-testing-inter-socket-latency-on/m-p/1271481#M7858</link>
    <description>&lt;P&gt;&lt;SPAN&gt;Hi, all&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;I tested the inter-socket latency of a dual Xeon platinum 8180 server with lmbench-lat_mem_rd, the command and the output was:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT color="#999999"&gt;numactl –C 0 –m 1 ./lat_mem_rd -P 1 -N 5 -t 4096m 1024&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_0-1617776222356.png" style="width: 162px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16187iBE995BEA22583FE5/image-dimensions/162x279/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="162" height="279" role="button" title="NickChiu_0-1617776222356.png" alt="NickChiu_0-1617776222356.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT color="#999999"&gt;(The left column is the test buffer size, and the right column is the test result with the unit being nano-second)&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;The funny thing was that, after the size of the test buffer overflowed all levels of cache on chip, a 10ns drop occurred with the buffer size continued to increase. I tried increasing the number of iteration by parameter –N, or warmup period by –W, the latency drop still occurred.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Then I wrote a simpler benchmark, trying to figure it out. My code initiates a total random link, one node of the link contains the pointer to the next node. The link part is exactly the same with lmbench, the only difference is the address pattern initialization. My code’s result is:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_1-1617776222357.png" style="width: 179px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16188i5270BEFF6673CF22/image-dimensions/179x46/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="179" height="46" role="button" title="NickChiu_1-1617776222357.png" alt="NickChiu_1-1617776222357.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;154ns, which is close to the peak of lmbench’s result. Then I shifted the code step by step towards lmbench-like to find out what caused the latency drop. I reproduced the result successfully when I just simply set the one way link access length within a specific range. If the access length is beyond a certain value, the resulted latency turns higher.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;To verify this conclusion, I adjusted lmbench’s code:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_2-1617776222361.png" style="width: 468px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16189i3486AC261322D29A/image-dimensions/468x527/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="468" height="527" role="button" title="NickChiu_2-1617776222361.png" alt="NickChiu_2-1617776222361.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_3-1617776222366.png" style="width: 469px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16190i7318F97EDC64C754/image-dimensions/469x448/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="469" height="448" role="button" title="NickChiu_3-1617776222366.png" alt="NickChiu_3-1617776222366.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;Simply just increased the one-way link access length by 9 times, and modify the count parameter to let it do the correct calculation. Now the new result for the same command became:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_4-1617776222367.png" style="width: 192px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16191iE641E24CCD152DC7/image-dimensions/192x275/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="192" height="275" role="button" title="NickChiu_4-1617776222367.png" alt="NickChiu_4-1617776222367.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;The significant latency drop disappeared, and the result agreed with my code. I also tested it with intel’s official tool, Memory Latency Checker:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_5-1617776222367.png" style="width: 329px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16192i8443C4267D6096B0/image-dimensions/329x97/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="329" height="97" role="button" title="NickChiu_5-1617776222367.png" alt="NickChiu_5-1617776222367.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;It went with the original lmbench.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;I did some more digging and I found that I can eliminate this phenomenon by disabling “directory mode” in BIOS→UPI config menu.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_6-1617776222372.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16194i0E298BE69A5E2073/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="NickChiu_6-1617776222372.png" alt="NickChiu_6-1617776222372.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;With directory mode disabled, lmbench, intel MLC, my code, all of them gave the same inter-socket latency, 169~170ns.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_7-1617776222379.png" style="width: 146px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16195iBC6414D663B354E7/image-dimensions/146x219/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="146" height="219" role="button" title="NickChiu_7-1617776222379.png" alt="NickChiu_7-1617776222379.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_8-1617776222379.png" style="width: 293px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16193i5A027B6F60ABE78C/image-dimensions/293x82/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="293" height="82" role="button" title="NickChiu_8-1617776222379.png" alt="NickChiu_8-1617776222379.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_9-1617776222379.png" style="width: 174px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16196iB27C524BFFDC8BC9/image-dimensions/174x36/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="174" height="36" role="button" title="NickChiu_9-1617776222379.png" alt="NickChiu_9-1617776222379.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;So, here’s the hypothesis: there’s certain mechanism associated with “directory mode” to do the inter-socket latency optimization. It “cheats” successfully on most commonly used latency benchmarks, but somehow it fails on simple shift to these benchmarks. &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Here’s the table summarizing all the data mentioned previously:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_10-1617776222380.png" style="width: 565px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16197i358D6E6C53F0D6A9/image-dimensions/565x158/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="565" height="158" role="button" title="NickChiu_10-1617776222380.png" alt="NickChiu_10-1617776222380.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;As an extension to this topic, this mechanism is able to do more with heavily loaded traffic.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_11-1617776222380.png" style="width: 451px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16198i66EA4B469D1E072A/image-dimensions/451x151/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="451" height="151" role="button" title="NickChiu_11-1617776222380.png" alt="NickChiu_11-1617776222380.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;What I would like to discuss here is, what mechanism would it be? Is there any chance that I could take advantage of it and optimize my code?&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 07 Apr 2021 06:48:46 GMT</pubDate>
    <dc:creator>NickChiu</dc:creator>
    <dc:date>2021-04-07T06:48:46Z</dc:date>
    <item>
      <title>An interesting issue when testing inter-socket latency on skylake processor</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/An-interesting-issue-when-testing-inter-socket-latency-on/m-p/1271481#M7858</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Hi, all&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;I tested the inter-socket latency of a dual Xeon platinum 8180 server with lmbench-lat_mem_rd, the command and the output was:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT color="#999999"&gt;numactl –C 0 –m 1 ./lat_mem_rd -P 1 -N 5 -t 4096m 1024&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_0-1617776222356.png" style="width: 162px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16187iBE995BEA22583FE5/image-dimensions/162x279/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="162" height="279" role="button" title="NickChiu_0-1617776222356.png" alt="NickChiu_0-1617776222356.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT color="#999999"&gt;(The left column is the test buffer size, and the right column is the test result with the unit being nano-second)&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;The funny thing was that, after the size of the test buffer overflowed all levels of cache on chip, a 10ns drop occurred with the buffer size continued to increase. I tried increasing the number of iteration by parameter –N, or warmup period by –W, the latency drop still occurred.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Then I wrote a simpler benchmark, trying to figure it out. My code initiates a total random link, one node of the link contains the pointer to the next node. The link part is exactly the same with lmbench, the only difference is the address pattern initialization. My code’s result is:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_1-1617776222357.png" style="width: 179px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16188i5270BEFF6673CF22/image-dimensions/179x46/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="179" height="46" role="button" title="NickChiu_1-1617776222357.png" alt="NickChiu_1-1617776222357.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;154ns, which is close to the peak of lmbench’s result. Then I shifted the code step by step towards lmbench-like to find out what caused the latency drop. I reproduced the result successfully when I just simply set the one way link access length within a specific range. If the access length is beyond a certain value, the resulted latency turns higher.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;To verify this conclusion, I adjusted lmbench’s code:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_2-1617776222361.png" style="width: 468px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16189i3486AC261322D29A/image-dimensions/468x527/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="468" height="527" role="button" title="NickChiu_2-1617776222361.png" alt="NickChiu_2-1617776222361.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_3-1617776222366.png" style="width: 469px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16190i7318F97EDC64C754/image-dimensions/469x448/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="469" height="448" role="button" title="NickChiu_3-1617776222366.png" alt="NickChiu_3-1617776222366.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;Simply just increased the one-way link access length by 9 times, and modify the count parameter to let it do the correct calculation. Now the new result for the same command became:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_4-1617776222367.png" style="width: 192px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16191iE641E24CCD152DC7/image-dimensions/192x275/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="192" height="275" role="button" title="NickChiu_4-1617776222367.png" alt="NickChiu_4-1617776222367.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;The significant latency drop disappeared, and the result agreed with my code. I also tested it with intel’s official tool, Memory Latency Checker:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_5-1617776222367.png" style="width: 329px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16192i8443C4267D6096B0/image-dimensions/329x97/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="329" height="97" role="button" title="NickChiu_5-1617776222367.png" alt="NickChiu_5-1617776222367.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;It went with the original lmbench.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;I did some more digging and I found that I can eliminate this phenomenon by disabling “directory mode” in BIOS→UPI config menu.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_6-1617776222372.png" style="width: 400px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16194i0E298BE69A5E2073/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="NickChiu_6-1617776222372.png" alt="NickChiu_6-1617776222372.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;With directory mode disabled, lmbench, intel MLC, my code, all of them gave the same inter-socket latency, 169~170ns.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_7-1617776222379.png" style="width: 146px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16195iBC6414D663B354E7/image-dimensions/146x219/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="146" height="219" role="button" title="NickChiu_7-1617776222379.png" alt="NickChiu_7-1617776222379.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_8-1617776222379.png" style="width: 293px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16193i5A027B6F60ABE78C/image-dimensions/293x82/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="293" height="82" role="button" title="NickChiu_8-1617776222379.png" alt="NickChiu_8-1617776222379.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_9-1617776222379.png" style="width: 174px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16196iB27C524BFFDC8BC9/image-dimensions/174x36/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="174" height="36" role="button" title="NickChiu_9-1617776222379.png" alt="NickChiu_9-1617776222379.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;So, here’s the hypothesis: there’s certain mechanism associated with “directory mode” to do the inter-socket latency optimization. It “cheats” successfully on most commonly used latency benchmarks, but somehow it fails on simple shift to these benchmarks. &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Here’s the table summarizing all the data mentioned previously:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_10-1617776222380.png" style="width: 565px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16197i358D6E6C53F0D6A9/image-dimensions/565x158/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="565" height="158" role="button" title="NickChiu_10-1617776222380.png" alt="NickChiu_10-1617776222380.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;As an extension to this topic, this mechanism is able to do more with heavily loaded traffic.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="NickChiu_11-1617776222380.png" style="width: 451px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/16198i66EA4B469D1E072A/image-dimensions/451x151/is-moderation-mode/true?v=v2&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" width="451" height="151" role="button" title="NickChiu_11-1617776222380.png" alt="NickChiu_11-1617776222380.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;What I would like to discuss here is, what mechanism would it be? Is there any chance that I could take advantage of it and optimize my code?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 07 Apr 2021 06:48:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/An-interesting-issue-when-testing-inter-socket-latency-on/m-p/1271481#M7858</guid>
      <dc:creator>NickChiu</dc:creator>
      <dc:date>2021-04-07T06:48:46Z</dc:date>
    </item>
  </channel>
</rss>

