Solved: How is occupancy calculated in the GPU hotspot analysis in VTune profiler for SYCL kernels?

SampathRachumallu · ‎05-16-2025

Hi,

I am trying to understand how Occupancy is calculated. My initial assumption is that, it is the total number of XVE threads scheduled in a given time to the maximum possible XVE threads that can be scheduled

But later i found that global memory latency, synchronization and many other factors affects occupancy. So my understanding of occupancy should be wrong, because irrespective of all above factors, the number of XVE threads that needs to be scheduled will be same to complete the workload given. So it is somehow related to overall execution time and i would like to understand that part

I am not able to get a clear understanding from the documentation, so wanted to check further on it

https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2025-1/eu-threads-occupancy.html

Svetlana_K_Intel · ‎05-19-2025

Hi @SampathRachumallu, first of all you might want to check https://oneapi-src.github.io/oneAPI-samples/Tools/GPU-Occupancy-Calculator/ to see what can be the "static" limiting factors for the occupancy. Those are global size, local size, kernel SIMD width, amount of SLM used, and prior PVC the usage of barriers could also limit the occupancy.

Apart from this - VTune is measuring occupancy using time-based sampling, so what it really gets for a sample: (sum of all the clocks when a thread was scheduled) / (sum of all the clocks * num of thread slots). So for a sample interval you can't really tell if it was e.g. 50% thread slots busy all the time, or 100% thread slots busy 50% of the time.
There can also be some "dynamic" aspects affecting occupancy, e.g. super short threads that finish so fast that scheduling overhead becomes visible.
And one last thing: VTune is currently not aware if the kernel runs in large GRF mode (calculator has this option BTW), thus VTune will still normalize occupancy by the full number of thread slots, so for such kernels the top-possible measured by VTune occupancy would be just 50%.

View solution in original post

Svetlana_K_Intel · ‎05-19-2025

Hi @SampathRachumallu, first of all you might want to check https://oneapi-src.github.io/oneAPI-samples/Tools/GPU-Occupancy-Calculator/ to see what can be the "static" limiting factors for the occupancy. Those are global size, local size, kernel SIMD width, amount of SLM used, and prior PVC the usage of barriers could also limit the occupancy.

Apart from this - VTune is measuring occupancy using time-based sampling, so what it really gets for a sample: (sum of all the clocks when a thread was scheduled) / (sum of all the clocks * num of thread slots). So for a sample interval you can't really tell if it was e.g. 50% thread slots busy all the time, or 100% thread slots busy 50% of the time.
There can also be some "dynamic" aspects affecting occupancy, e.g. super short threads that finish so fast that scheduling overhead becomes visible.
And one last thing: VTune is currently not aware if the kernel runs in large GRF mode (calculator has this option BTW), thus VTune will still normalize occupancy by the full number of thread slots, so for such kernels the top-possible measured by VTune occupancy would be just 50%.

yuzhang3_intel · ‎05-19-2025

To better understand the metric of XVE Thread Occupancy, you can refer to the GPU occupancy section in the GPU tuning guide below:

https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2025-0/thread-mapping-and-gpu-occupancy.html

How is occupancy calculated in the GPU hotspot analysis in VTune profiler for SYCL kernels?

Intel VTune™ Profiler