<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Intermittent read and write bandwidth performance degradation under lnl iGPU in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Intermittent-read-and-write-bandwidth-performance-degradation/m-p/1701943#M8554</link>
    <description>&lt;H4&gt;1 Overview&lt;/H4&gt;&lt;P&gt;I have been testing the usage of memory bandwidth under iGPU recently, using my own implemented kernel to measure and analyze the memory read, write and copy bandwidth.&amp;nbsp;&lt;/P&gt;&lt;P&gt;When performing warm-ups and multiple consecutive kernel launches and taking the average, the resulting bandwidth is very close to the physical bandwidth limit of memory.&amp;nbsp;&lt;/P&gt;&lt;P&gt;But after &lt;STRONG&gt;sleeping for a period of time&lt;/STRONG&gt; before launching the kernel in each loop, it will be found that the bandwidth will decrease on &lt;STRONG&gt;the lunar lake platform&lt;/STRONG&gt; machine. But on the rpl and mtl platform,&amp;nbsp;there is no decrease in bandwidth throughput.&amp;nbsp;I am curious why this phenomenon happens and does not exist on mtl and rpl.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H4&gt;&lt;SPAN&gt;2 MRE&lt;/SPAN&gt;&lt;/H4&gt;&lt;P&gt;&lt;SPAN&gt;`opencl_bandwidth_test.cpp` is provided in the attachment.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="bash"&gt;g++ -std=c++11 opencl_bandwidth_test.cpp -o opencl_bandwidth_test -lOpenCL&lt;/LI-CODE&gt;&lt;P&gt;&lt;BR /&gt;This code will test the read, write, and copy bandwidth for 1 GB buffers at different intervals (0 ms, 1 ms, 5 ms, 10 ms, 100 ms, and 500 ms).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H4&gt;3 Results&lt;/H4&gt;&lt;P&gt;I ran this program on RPL, MTL, and LNL.&lt;/P&gt;&lt;H5&gt;&lt;U&gt;&lt;STRONG&gt;1) Raptor Lake Platform (normal)&lt;/STRONG&gt;&lt;/U&gt;&lt;/H5&gt;&lt;UL&gt;&lt;LI&gt;Desktop&lt;/LI&gt;&lt;LI&gt;CPU:&amp;nbsp;Intel(R) Core(TM) i5-14500&lt;/LI&gt;&lt;LI&gt;GPU: Xe LP&lt;/LI&gt;&lt;LI&gt;Memory: 2 * 16GB, DDR4 3200 MT/s, theoretical bandwidth ~50GB/s&lt;/LI&gt;&lt;LI&gt;Motherboard:&amp;nbsp;ASUSTeK COMPUTER INC., TX GAMING B760M WIFI D4,&amp;nbsp;Rev 1.xx&lt;/LI&gt;&lt;LI&gt;OS: Ubuntu 24.04.2 LTS + Linux 6.15.0&lt;/LI&gt;&lt;LI&gt;Kernel Driver: i915 + xe (both test, same results)&lt;/LI&gt;&lt;LI&gt;Compute Runtime:&amp;nbsp;&lt;A href="https://github.com/intel/compute-runtime/releases/tag/25.18.33578.6" target="_blank" rel="noopener"&gt;25.18.33578.6&lt;/A&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;TABLE border="1" width="100%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;0 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;1 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;5 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;10 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;100 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;500 ms&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;Read&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;38.56 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;37.50 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;33.06 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;35.97 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;36.67 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;36.68 GB/s&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;Write&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;40.09 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;39.00 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;38.36 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;39.37 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;39.16 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;38.99 GB/s&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;Copy *2&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;41.24 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;41.57 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;41.55 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;41.34 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;41.82 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;40.75 GB/s&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Under this setup, the duration of the interval does not significantly affect the read and write bandwidth.&lt;/P&gt;&lt;H5&gt;&lt;U&gt;&lt;STRONG&gt;2) Meteor Lake Platform&amp;nbsp;&lt;/STRONG&gt;&lt;/U&gt;&lt;U&gt;&lt;STRONG&gt;(normal)&lt;/STRONG&gt;&lt;/U&gt;&lt;/H5&gt;&lt;UL&gt;&lt;LI&gt;Mini PC / NUC:&amp;nbsp;&lt;A href="https://www.asus.com/displays-desktops/nucs/nuc-mini-pcs/asus-nuc-14-pro-plus/techspec/" target="_blank" rel="noopener"&gt;ASUS NUC 14 Pro+&lt;/A&gt;&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;CPU:&amp;nbsp;Intel(R) Core(TM) Ultra 9 185H&lt;/LI&gt;&lt;LI&gt;GPU: Xe LPG&lt;/LI&gt;&lt;LI&gt;Memory: 2 * 48 GB, DDR5 5600 MT/s, theoretical bandwidth ~90GB/s&lt;/LI&gt;&lt;LI&gt;Motherboard:&amp;nbsp;ASUSTeK COMPUTER INC., NUC14RVS, 60AS0080-MB4A01&lt;/LI&gt;&lt;LI&gt;OS:&amp;nbsp;Ubuntu 22.04.5 LTS + Linux&amp;nbsp;6.8.0-60-generic&lt;/LI&gt;&lt;LI&gt;Kernel Driver: i915&lt;/LI&gt;&lt;LI&gt;Compute Runtime:&amp;nbsp;&lt;A href="https://github.com/intel/compute-runtime/releases/tag/24.52.32224.5" target="_blank" rel="noopener"&gt;24.52.32224.5&lt;/A&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;TABLE border="1"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;0 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;1 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;5 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;10 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;100 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;500 ms&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;Read&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;62.51 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;62.58 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;62.61 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;62.71 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;59.97 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;59.98 GB/s&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;Write&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;73.21 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;73.19 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;73.27 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;73.02 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;70.09 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;69.86 GB/s&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;Copy *2&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;69.65 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;69.66 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;69.50 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;69.62 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;68.28 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;68.27 GB/s&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Under this setup, the duration of the interval does not significantly affect the read and write bandwidth.&lt;/P&gt;&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;3) Lunar Lake Platform (decreased)&lt;/STRONG&gt;&lt;/U&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Laptop:&amp;nbsp;&lt;A href="https://www.asus.com/hk-en/laptops/for-home/zenbook/asus-zenbook-s-14-ux5406/techspec/" target="_blank" rel="noopener"&gt;ASUS Zenbook S14 (UX5406)&lt;/A&gt;&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;CPU:&amp;nbsp;Intel(R) Core(TM) Ultra 7 258V&lt;/LI&gt;&lt;LI&gt;GPU: Xe2 LPG&lt;/LI&gt;&lt;LI&gt;Memory: 8 * 4 GB, LPDDR5 8533 MT/s, theoretical bandwidth ~130GB/s&lt;/LI&gt;&lt;LI&gt;Motherboard:&amp;nbsp;ASUSTeK COMPUTER INC., UX5406SA, 1.0&lt;/LI&gt;&lt;LI&gt;OS:&amp;nbsp;Ubuntu 24.10 + Linux&amp;nbsp;6.15.2-061502-generic&lt;/LI&gt;&lt;LI&gt;Kernel Driver: xe&lt;/LI&gt;&lt;LI&gt;Compute Runtime:&amp;nbsp;&lt;A href="https://github.com/intel/compute-runtime/releases/tag/25.13.33276.16" target="_blank" rel="noopener"&gt;25.09.32961.5&lt;/A&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;TABLE border="1"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;0 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;1 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;5 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;10 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;100 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;500 ms&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;Read&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;89.60 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;89.18 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;87.78 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;79.25 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;71.78 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;73.11 GB/s&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;Write&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;83.74 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;82.61 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;79.80 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;78.42 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;67.18 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;66.85 GB/s&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;Copy *2&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;88.02 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;87.54 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;88.88 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;83.89 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;80.83 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;80.81 GB/s&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Under this setup, the read and write bandwidth dropped from ~90 GB/s to ~70 GB/s, and copy bandwidth slightly decreased.&amp;nbsp;&lt;/P&gt;&lt;H4&gt;4&amp;nbsp;Some Conjectures&lt;/H4&gt;&lt;OL&gt;&lt;LI&gt;It should be caused by different hardware architectures, otherwise this phenomenon would not only occur on the lnl.&amp;nbsp;There is very little publicly available information on the latest microarchitecture, and I am not sure if it is due to lnl's 8 MB memory side cache (system level cache, SLC). Is it because of cache coherence or cache competition on SLC?&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;This issue is strongly related to the iGPU.&amp;nbsp;Because I intermittently make the CPU perform `memset`, it can always fill up to 100 GB/s of memory bandwidth (on the lnl platform).&lt;/LI&gt;&lt;LI&gt;In the test code, CPU sleep was performed. But if a time-consuming and compute-bound GPU kernel is used instead of CPU sleep, and the computation kernel and bandwidth testing kernel are alternately executed, a decrease in bandwidth will also be observed.&lt;/LI&gt;&lt;LI&gt;I tested the write kernel on lnl using vtune, but I couldn't see why it slowed down. It seems that XVE Thread Occupancy will be higher when the bandwidth is lower.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Nagico_0-1751887150689.png" style="width: 999px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/67284iC2064CCECB05A63C/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="Nagico_0-1751887150689.png" alt="Nagico_0-1751887150689.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 07 Jul 2025 11:28:31 GMT</pubDate>
    <dc:creator>Nagico</dc:creator>
    <dc:date>2025-07-07T11:28:31Z</dc:date>
    <item>
      <title>Intermittent read and write bandwidth performance degradation under lnl iGPU</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Intermittent-read-and-write-bandwidth-performance-degradation/m-p/1701943#M8554</link>
      <description>&lt;H4&gt;1 Overview&lt;/H4&gt;&lt;P&gt;I have been testing the usage of memory bandwidth under iGPU recently, using my own implemented kernel to measure and analyze the memory read, write and copy bandwidth.&amp;nbsp;&lt;/P&gt;&lt;P&gt;When performing warm-ups and multiple consecutive kernel launches and taking the average, the resulting bandwidth is very close to the physical bandwidth limit of memory.&amp;nbsp;&lt;/P&gt;&lt;P&gt;But after &lt;STRONG&gt;sleeping for a period of time&lt;/STRONG&gt; before launching the kernel in each loop, it will be found that the bandwidth will decrease on &lt;STRONG&gt;the lunar lake platform&lt;/STRONG&gt; machine. But on the rpl and mtl platform,&amp;nbsp;there is no decrease in bandwidth throughput.&amp;nbsp;I am curious why this phenomenon happens and does not exist on mtl and rpl.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H4&gt;&lt;SPAN&gt;2 MRE&lt;/SPAN&gt;&lt;/H4&gt;&lt;P&gt;&lt;SPAN&gt;`opencl_bandwidth_test.cpp` is provided in the attachment.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="bash"&gt;g++ -std=c++11 opencl_bandwidth_test.cpp -o opencl_bandwidth_test -lOpenCL&lt;/LI-CODE&gt;&lt;P&gt;&lt;BR /&gt;This code will test the read, write, and copy bandwidth for 1 GB buffers at different intervals (0 ms, 1 ms, 5 ms, 10 ms, 100 ms, and 500 ms).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H4&gt;3 Results&lt;/H4&gt;&lt;P&gt;I ran this program on RPL, MTL, and LNL.&lt;/P&gt;&lt;H5&gt;&lt;U&gt;&lt;STRONG&gt;1) Raptor Lake Platform (normal)&lt;/STRONG&gt;&lt;/U&gt;&lt;/H5&gt;&lt;UL&gt;&lt;LI&gt;Desktop&lt;/LI&gt;&lt;LI&gt;CPU:&amp;nbsp;Intel(R) Core(TM) i5-14500&lt;/LI&gt;&lt;LI&gt;GPU: Xe LP&lt;/LI&gt;&lt;LI&gt;Memory: 2 * 16GB, DDR4 3200 MT/s, theoretical bandwidth ~50GB/s&lt;/LI&gt;&lt;LI&gt;Motherboard:&amp;nbsp;ASUSTeK COMPUTER INC., TX GAMING B760M WIFI D4,&amp;nbsp;Rev 1.xx&lt;/LI&gt;&lt;LI&gt;OS: Ubuntu 24.04.2 LTS + Linux 6.15.0&lt;/LI&gt;&lt;LI&gt;Kernel Driver: i915 + xe (both test, same results)&lt;/LI&gt;&lt;LI&gt;Compute Runtime:&amp;nbsp;&lt;A href="https://github.com/intel/compute-runtime/releases/tag/25.18.33578.6" target="_blank" rel="noopener"&gt;25.18.33578.6&lt;/A&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;TABLE border="1" width="100%"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;0 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;1 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;5 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;10 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;100 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;500 ms&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;Read&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;38.56 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;37.50 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;33.06 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;35.97 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;36.67 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;36.68 GB/s&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;Write&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;40.09 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;39.00 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;38.36 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;39.37 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;39.16 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;38.99 GB/s&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;Copy *2&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;41.24 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;41.57 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;41.55 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;41.34 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;41.82 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;40.75 GB/s&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Under this setup, the duration of the interval does not significantly affect the read and write bandwidth.&lt;/P&gt;&lt;H5&gt;&lt;U&gt;&lt;STRONG&gt;2) Meteor Lake Platform&amp;nbsp;&lt;/STRONG&gt;&lt;/U&gt;&lt;U&gt;&lt;STRONG&gt;(normal)&lt;/STRONG&gt;&lt;/U&gt;&lt;/H5&gt;&lt;UL&gt;&lt;LI&gt;Mini PC / NUC:&amp;nbsp;&lt;A href="https://www.asus.com/displays-desktops/nucs/nuc-mini-pcs/asus-nuc-14-pro-plus/techspec/" target="_blank" rel="noopener"&gt;ASUS NUC 14 Pro+&lt;/A&gt;&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;CPU:&amp;nbsp;Intel(R) Core(TM) Ultra 9 185H&lt;/LI&gt;&lt;LI&gt;GPU: Xe LPG&lt;/LI&gt;&lt;LI&gt;Memory: 2 * 48 GB, DDR5 5600 MT/s, theoretical bandwidth ~90GB/s&lt;/LI&gt;&lt;LI&gt;Motherboard:&amp;nbsp;ASUSTeK COMPUTER INC., NUC14RVS, 60AS0080-MB4A01&lt;/LI&gt;&lt;LI&gt;OS:&amp;nbsp;Ubuntu 22.04.5 LTS + Linux&amp;nbsp;6.8.0-60-generic&lt;/LI&gt;&lt;LI&gt;Kernel Driver: i915&lt;/LI&gt;&lt;LI&gt;Compute Runtime:&amp;nbsp;&lt;A href="https://github.com/intel/compute-runtime/releases/tag/24.52.32224.5" target="_blank" rel="noopener"&gt;24.52.32224.5&lt;/A&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;TABLE border="1"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;0 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;1 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;5 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;10 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;100 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;500 ms&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;Read&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;62.51 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;62.58 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;62.61 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;62.71 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;59.97 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;59.98 GB/s&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;Write&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;73.21 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;73.19 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;73.27 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;73.02 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;70.09 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;69.86 GB/s&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;Copy *2&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;69.65 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;69.66 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;69.50 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;69.62 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;68.28 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;68.27 GB/s&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Under this setup, the duration of the interval does not significantly affect the read and write bandwidth.&lt;/P&gt;&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;3) Lunar Lake Platform (decreased)&lt;/STRONG&gt;&lt;/U&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Laptop:&amp;nbsp;&lt;A href="https://www.asus.com/hk-en/laptops/for-home/zenbook/asus-zenbook-s-14-ux5406/techspec/" target="_blank" rel="noopener"&gt;ASUS Zenbook S14 (UX5406)&lt;/A&gt;&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;CPU:&amp;nbsp;Intel(R) Core(TM) Ultra 7 258V&lt;/LI&gt;&lt;LI&gt;GPU: Xe2 LPG&lt;/LI&gt;&lt;LI&gt;Memory: 8 * 4 GB, LPDDR5 8533 MT/s, theoretical bandwidth ~130GB/s&lt;/LI&gt;&lt;LI&gt;Motherboard:&amp;nbsp;ASUSTeK COMPUTER INC., UX5406SA, 1.0&lt;/LI&gt;&lt;LI&gt;OS:&amp;nbsp;Ubuntu 24.10 + Linux&amp;nbsp;6.15.2-061502-generic&lt;/LI&gt;&lt;LI&gt;Kernel Driver: xe&lt;/LI&gt;&lt;LI&gt;Compute Runtime:&amp;nbsp;&lt;A href="https://github.com/intel/compute-runtime/releases/tag/25.13.33276.16" target="_blank" rel="noopener"&gt;25.09.32961.5&lt;/A&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;TABLE border="1"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;0 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;1 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;5 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;10 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;100 ms&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;500 ms&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;Read&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;89.60 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;89.18 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;87.78 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;79.25 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;71.78 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;73.11 GB/s&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;Write&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;83.74 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;82.61 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;79.80 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;78.42 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;67.18 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;66.85 GB/s&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="14.285714285714286%"&gt;Copy *2&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;88.02 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;87.54 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;88.88 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;83.89 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;80.83 GB/s&lt;/TD&gt;&lt;TD width="14.285714285714286%"&gt;80.81 GB/s&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Under this setup, the read and write bandwidth dropped from ~90 GB/s to ~70 GB/s, and copy bandwidth slightly decreased.&amp;nbsp;&lt;/P&gt;&lt;H4&gt;4&amp;nbsp;Some Conjectures&lt;/H4&gt;&lt;OL&gt;&lt;LI&gt;It should be caused by different hardware architectures, otherwise this phenomenon would not only occur on the lnl.&amp;nbsp;There is very little publicly available information on the latest microarchitecture, and I am not sure if it is due to lnl's 8 MB memory side cache (system level cache, SLC). Is it because of cache coherence or cache competition on SLC?&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;This issue is strongly related to the iGPU.&amp;nbsp;Because I intermittently make the CPU perform `memset`, it can always fill up to 100 GB/s of memory bandwidth (on the lnl platform).&lt;/LI&gt;&lt;LI&gt;In the test code, CPU sleep was performed. But if a time-consuming and compute-bound GPU kernel is used instead of CPU sleep, and the computation kernel and bandwidth testing kernel are alternately executed, a decrease in bandwidth will also be observed.&lt;/LI&gt;&lt;LI&gt;I tested the write kernel on lnl using vtune, but I couldn't see why it slowed down. It seems that XVE Thread Occupancy will be higher when the bandwidth is lower.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Nagico_0-1751887150689.png" style="width: 999px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/67284iC2064CCECB05A63C/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="Nagico_0-1751887150689.png" alt="Nagico_0-1751887150689.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 07 Jul 2025 11:28:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Intermittent-read-and-write-bandwidth-performance-degradation/m-p/1701943#M8554</guid>
      <dc:creator>Nagico</dc:creator>
      <dc:date>2025-07-07T11:28:31Z</dc:date>
    </item>
  </channel>
</rss>

