Long Execution Time for First Frame in DPC++ Project, Different Results in Debug and Release Modes

-Light- · ‎05-27-2024

Hello, I recently ported a CUDA SGM project to DPC++ using oneAPI to run on an Intel GPU (Intel(R) Arc(TM) A370M Graphics). I have encountered two issues.

Issue 1: Long Execution Time for First Frame
The project involves a loop that runs 10 times, but I've noticed that the execution time for the first frame is significantly longer. How can I address this issue? Below is the partial code for the loop running 10 times:

Execution Results:

when using the integrated GPU Intel(R) UHD Graphics 770, the execution time for the first run is 8596ms, the second run is 47ms, and the subsequent runs are around 26ms.

Issue 2: Different Results in Debug and Release Modes

Additionally, I observed that the results of the project ported to DPC++ using oneAPI differ between Debug and Release modes in Visual Studio. Through experimentation, I found that the discrepancy occurs in the get_sub_group() function. Below is a snippet of the code I used for testing:

When running in Debug mode (without code optimization), the output is 16. However, when running in Release mode (with optimization levels O1 or O2), the output is 32. Despite setting the required subgroup size to 32 using [[intel::reqd_sub_group_size(32)]], the output differs between Debug and Release modes.

System Environment:

CPU: 12th Gen Intel(R) Core(TM) i7-12700H @ 2.30 GHz
GPU: Intel(R) Arc(TM) A370M Graphics
OS: Windows 10 IoT Enterprise 22H2

Thank you for your assistance.

Best regards,
Light

MikeDB · ‎05-28-2024

Hi,

I'm not a SYCL expert but posting this mostly as a thankyou for your observation that reqd_sub_group_size doesn't seem to be guaranteed in debug builds in Visual Studio 2022, I'm seeing the same effect with Visual Studio 2022 on Intel Iris Xe GPUs.

It's worth saying that Intel only officially tests and qualifies particular Visual Studio versions, at this point VS 2022 17.9.2

Intel® Compilers compatibility with Microsoft Visual Studio* and...

According to

SYCL™ 2020 Specification (revision (khronos.org)

with reqd_sub_group_size the kernel should fail to execute if the sub-group size is not supported. I'd speculate that maybe the kernel can't run with this sub-group size in a non-optimized build because of excessive register pressure (increasing the sub-group size increases register pressure). Doesn't seem good that it silently uses a different size rather than produce an error. Presumably your code *might* still work if you used the run-time queried sub-group size though perhaps it would be less optimal than using a compile time constant. Another potential alternative is to use a specialization constant for the sub-group size though I haven't tried this yet.

I'm also seeing other behaviour differences in debug builds that are breaking tests that work fine in release builds. From my previous experience these may be differences in synchronisation or race conditions. It would be useful to know other potential reasons for behaviour differences between optimised and non-optimised SYCL builds as I'm still trying to find the source of my issues.

Re issue 1: I'd say it's likely because your kernel is being JIT compiled by the program/system/GPU driver before first execution. There are alternative compilation strategies covered Compilation Mode in the Data Parallel C++ second edition free ebook though I haven't tried them yet.

I've also found in my tests that running on a GPU that is also in use by the OS (e.g. for graphics output) causes contention and large variations in performance even when measured by the SYCL queue profiling services. I've added a pause of a second between runs to minimise this effect in my own tests and the timings are now much more consistent.

MikeDB · ‎05-28-2024

Issue 1 is likely happening because your kernel is being JIT compiled by the GPU driver on first execution. There are alternative compilation strategies - Data Parallel C++ second edition describes this stuff well IMO. Timings can also vary due to contention with OS usage of the GPUs. I've added a second delay between each run of my own code to minimize this timing variability.

I've also seen your issue 2 happening in my own code in debug builds on Visual Studio 2022 running on Iris Xe integrated GPU. I'd speculate that the non-optimized build can't be built and/or run with this sub-group size due to excessive register pressure, though it would be good to get some error feedback rather than changing the sub-group size without warning.

However I'm also seeing other behavior differences in my debug builds that are breaking my unit tests. Possibly caused by synchronization differences or race conditions though I'm still investigating. IMO it's not great if debug builds break as it makes it harder to debug problems that happen in release builds. Any suggestions for other potential causes of debug/release build behavior differences would be very welcome.

Officially Intel only qualifies particular Visual Studio versions - Intel® Compilers compatibility with Microsoft Visual Studio* and... currently 17.9.2.