Intermittent read and write bandwidth performance degradation under lnl iGPU

Nagico · ‎07-07-2025

1 Overview

I have been testing the usage of memory bandwidth under iGPU recently, using my own implemented kernel to measure and analyze the memory read, write and copy bandwidth.

When performing warm-ups and multiple consecutive kernel launches and taking the average, the resulting bandwidth is very close to the physical bandwidth limit of memory.

But after sleeping for a period of time before launching the kernel in each loop, it will be found that the bandwidth will decrease on the lunar lake platform machine. But on the rpl and mtl platform, there is no decrease in bandwidth throughput. I am curious why this phenomenon happens and does not exist on mtl and rpl.

2 MRE

`opencl_bandwidth_test.cpp` is provided in the attachment.

g++ -std=c++11 opencl_bandwidth_test.cpp -o opencl_bandwidth_test -lOpenCL

This code will test the read, write, and copy bandwidth for 1 GB buffers at different intervals (0 ms, 1 ms, 5 ms, 10 ms, 100 ms, and 500 ms).

3 Results

I ran this program on RPL, MTL, and LNL.

1) Raptor Lake Platform (normal)

Desktop
CPU: Intel(R) Core(TM) i5-14500
GPU: Xe LP
Memory: 2 * 16GB, DDR4 3200 MT/s, theoretical bandwidth ~50GB/s
Motherboard: ASUSTeK COMPUTER INC., TX GAMING B760M WIFI D4, Rev 1.xx
OS: Ubuntu 24.04.2 LTS + Linux 6.15.0
Kernel Driver: i915 + xe (both test, same results)
Compute Runtime: 25.18.33578.6

	0 ms	1 ms	5 ms	10 ms	100 ms	500 ms
Read	38.56 GB/s	37.50 GB/s	33.06 GB/s	35.97 GB/s	36.67 GB/s	36.68 GB/s
Write	40.09 GB/s	39.00 GB/s	38.36 GB/s	39.37 GB/s	39.16 GB/s	38.99 GB/s
Copy *2	41.24 GB/s	41.57 GB/s	41.55 GB/s	41.34 GB/s	41.82 GB/s	40.75 GB/s

Under this setup, the duration of the interval does not significantly affect the read and write bandwidth.

2) Meteor Lake Platform (normal)

Mini PC / NUC: ASUS NUC 14 Pro+
CPU: Intel(R) Core(TM) Ultra 9 185H
GPU: Xe LPG
Memory: 2 * 48 GB, DDR5 5600 MT/s, theoretical bandwidth ~90GB/s
Motherboard: ASUSTeK COMPUTER INC., NUC14RVS, 60AS0080-MB4A01
OS: Ubuntu 22.04.5 LTS + Linux 6.8.0-60-generic
Kernel Driver: i915
Compute Runtime: 24.52.32224.5

	0 ms	1 ms	5 ms	10 ms	100 ms	500 ms
Read	62.51 GB/s	62.58 GB/s	62.61 GB/s	62.71 GB/s	59.97 GB/s	59.98 GB/s
Write	73.21 GB/s	73.19 GB/s	73.27 GB/s	73.02 GB/s	70.09 GB/s	69.86 GB/s
Copy *2	69.65 GB/s	69.66 GB/s	69.50 GB/s	69.62 GB/s	68.28 GB/s	68.27 GB/s

Under this setup, the duration of the interval does not significantly affect the read and write bandwidth.

3) Lunar Lake Platform (decreased)

Laptop: ASUS Zenbook S14 (UX5406)
CPU: Intel(R) Core(TM) Ultra 7 258V
GPU: Xe2 LPG
Memory: 8 * 4 GB, LPDDR5 8533 MT/s, theoretical bandwidth ~130GB/s
Motherboard: ASUSTeK COMPUTER INC., UX5406SA, 1.0
OS: Ubuntu 24.10 + Linux 6.15.2-061502-generic
Kernel Driver: xe
Compute Runtime: 25.09.32961.5

	0 ms	1 ms	5 ms	10 ms	100 ms	500 ms
Read	89.60 GB/s	89.18 GB/s	87.78 GB/s	79.25 GB/s	71.78 GB/s	73.11 GB/s
Write	83.74 GB/s	82.61 GB/s	79.80 GB/s	78.42 GB/s	67.18 GB/s	66.85 GB/s
Copy *2	88.02 GB/s	87.54 GB/s	88.88 GB/s	83.89 GB/s	80.83 GB/s	80.81 GB/s

Under this setup, the read and write bandwidth dropped from ~90 GB/s to ~70 GB/s, and copy bandwidth slightly decreased.

4 Some Conjectures

It should be caused by different hardware architectures, otherwise this phenomenon would not only occur on the lnl. There is very little publicly available information on the latest microarchitecture, and I am not sure if it is due to lnl's 8 MB memory side cache (system level cache, SLC). Is it because of cache coherence or cache competition on SLC?
This issue is strongly related to the iGPU. Because I intermittently make the CPU perform `memset`, it can always fill up to 100 GB/s of memory bandwidth (on the lnl platform).
In the test code, CPU sleep was performed. But if a time-consuming and compute-bound GPU kernel is used instead of CPU sleep, and the computation kernel and bandwidth testing kernel are alternately executed, a decrease in bandwidth will also be observed.
I tested the write kernel on lnl using vtune, but I couldn't see why it slowed down. It seems that XVE Thread Occupancy will be higher when the bandwidth is lower.