Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*
628 Discussions

Low GPU utilization in multithreaded application

wdx04
Beginner
4,092 Views

Hi,

 

I'm porting some GPU algorithms from C++ AMP to DPC++/SYCL since C++ AMP had been deprecated by Microsoft.

Then I encountered some performance problems with the DPC++ compiler. 

The biggest problem is, when I submit multiple small SYCL kernels to a queue(no data dependency between them) inside a tbb:parallel_for loop, they seem to be executed sequentially. The average GPU utilization observed using GPU-Z is around 10%(A750). The original C++ AMP algorithm in the same tbb:parallel_for loop can execute in parallel and archive a GPU utilization of 80%-90%. Is it a limitation of the current compiler, or how can I find the cause of the problem?

Another problem is the optimization capability of the current compiler seems not very stable. When I upgrade the compiler from 2022.3 to 2023.0, the same SYCL kernel runs 50% faster with my Kabylake iGPU but also runs 40% slower with my Arc A750 GPU.  Are there any compiler options to optimize better for the Arc-series GPUs?

My test environments:

HM175, i7-7700HQ, HD630 Graphics, 8Gx2 DDR4, Windows 10 21H2

Z170, i7-6700K, Arc A750 Graphics(without Resizable Bar), 8Gx4 DDR4, Windows 10 22H2

Thanks.

0 Kudos
17 Replies
SeshaP_Intel
Moderator
4,065 Views

Hi,


Thank you for posting in Intel Communities.

Could you please share the complete reproducer code and steps to reproduce so that we can investigate the issue more from our end?


Thanks and Regards,

Pendyala Sesha Srinivas


0 Kudos
wdx04
Beginner
4,050 Views

Hi,

Here are the reproducer code and image files used for testing:

https://1drv.ms/u/s!ApduJQeHcf7MgfEllZDK2OZP03mErg?e=huqPik

Please compile the code with Visual Studio 2022 and the DPC++ compiler, then run the problem with search.png and target.png in the same folder as the executable and monitor the GPU usage.

Note the program is part of a brute-force image template matching algorithm and requires OpenCV to compile.

You may also try the compiled executable package with everything except the Visual C++ 2022 runtime:

https://1drv.ms/u/s!ApduJQeHcf7MgfEmn1m-vdH3MAKCtA?e=bUyt7x

Typical output of the program compiled using DPC++ 2023.0:

Device Name: Intel(R) Arc(TM) A750 Graphics
Number of Points: 135108
Number of Transforms: 9261
Best Score: 26153
Best Transform: 0.999714,0.0239087,181.7;-0.0239087,0.999714,193.4
Benchmarking Parallel execution...
Parallel Iteration time: 30.25ms.
Benchmarking Sequential execution...
Sequential Iteration time: 32.84ms.

(Note: When compiled using DPC++ 2022.3, the sequential execution time is ~22 ms, and the idential program in C++ AMP uses ~24.6ms in parallel model and ~30ms in sequential mode)

0 Kudos
SeshaP_Intel
Moderator
3,976 Views

Hi,

 

While trying to build your code, we are facing some errors in Visual Studio 17 2022. Please find the error screenshot below.

SeshaP_Intel_0-1674023082554.png

 

Could you please provide the detailed steps to investigate this issue? And please provide dependencies if you have added any.

Could you please help us to resolve this issue?

 

Thanks and Regards,

Pendyala Sesha Srinivas

0 Kudos
wdx04
Beginner
3,939 Views

No, I never see this kind of errors before.

Please check:

1. Which OpenCV version are you using? I'm using OpenCV 4.5.5 built with vcpkg. The official 4.5.5 binary release should also be OK. You may comment out the lines of code using the Matx23f type inside the kernel. If the errors are gone, the problem must be related to OpenCV.

2. Which build configuration are you using? I'm using the x64 Release configuration. The platform toolset is set to Intel oneAPI DPC++ Compiler 2023, C++ languange standard is set to C++17, other configurations may not work.

0 Kudos
SeshaP_Intel
Moderator
3,867 Views

Hi,

 

We have tried OpenCV-4.7.0 and 4.5.5 versions and using the x64 Release configuration. Please provide the library dependencies if you have added any. We are facing linking errors related to OpenCV while building the code. Please find the error screenshot below. 

SeshaP_Intel_0-1674821398124.png

Could you please help us resolve this issue?

 

Thanks and Regards,

Pendyala Sesha Srinivas

 

0 Kudos
wdx04
Beginner
3,848 Views

Hi,

 

The missing symbol "void Mat::copyTo(const _OutputArray&) const" is part of the OpenCV core module.

The import library containing this symbol depends on where did you get the OpenCV package. For example,

OpenCV 4.5.5 official package: opencv\build\x64\vc15\lib\opencv_world455.lib

OpenCV 4.7.0 official package: opencv\build\x64\vc16\lib\opencv_world470.lib

Built from source with vcpkg, without [world] option: vcpkg\installed\x64-windows\lib\opencv_core.lib (and opencv_imgcodecs.lib)

Built from source with vcpkg, with [world] option: vcpkg\installed\x64-windows\lib\opencv_world.lib

 

Thanks.

 

0 Kudos
SeshaP_Intel
Moderator
3,767 Views

Hi,

 

We are able to reproduce your issue. I have tried to run the executable on Intel(R) UHD Graphics. Please find the below output screenshot.

SeshaP_Intel_0-1675241848737.png

Did you get the same output while running with HD630 Graphics? If not please provide the output generated while running on HD630 Graphics.

 

Thanks and Regards,

Pendyala Sesha Srinivas

0 Kudos
wdx04
Beginner
3,741 Views

Hi,

 

Yes, I got the same output on i7-7700HQ except for the device name and interation times:

results.jpg

Here is the output from the C++AMP version of the program:

results2.jpg

The outputs when running on i7-6700K+A750:

results3.jpg

Thanks.

 

0 Kudos
SeshaP_Intel
Moderator
3,539 Views

Hi,


Thanks for providing the information.

Could you please provide C++ AMP files(executable also)?

How are you calculating the GPU utilization(%)?

Could you please confirm that you are expecting a better performance in Parallel Iteration time on Intel(R) Arc(TM) A750 Graphics too?


Thanks and Regards,

Pendyala Sesha Srinivas


0 Kudos
wdx04
Beginner
3,524 Views

Hi,

Here are the C++AMP files:

https://1drv.ms/u/s!ApduJQeHcf7MgfEnByQHECipWDq_5g?e=5VeLQi

I just monitor the GPU utilization with GPU-z or Windows Task Manager.

Of course, I'm expecting better performance in both  Parallel Iteration time and  Sequential Iteration time on Intel(R) Arc(TM) A750 Graphics. As mentioned before, performance on A750 was much better when compiled using the previous 2022.3 version of the DPC++ compiler.

Thanks.

0 Kudos
SeshaP_Intel
Moderator
3,414 Views

Hi,


We were able to reproduce your issue. We are working on this internally.

We will get back to you soon.


Thanks and Regards,

Pendyala Sesha Srinivas


0 Kudos
SeshaP_Intel
Moderator
2,782 Views

Hi,


Thanks for your patience.


Could you please try with the latest oneAPI 2023.1.0 release after installing the latest Intel® Graphics Driver 31.0.101.4255 for Intel® Arc™ Graphics and share the results with us?  

Please find the link below.

https://www.intel.com/content/www/us/en/download/726609/intel-arc-iris-xe-graphics-whql-windows.html?


Another reason for low GPU occupancy is due to the launching of the sycl kernel within tbb::parallel_for which is counterintuitive as the purpose of SYCL is to achieve data parallelism and launching this under thread parallelism introduces unnecessary thread waits and synchronizations. Could you please try launching sycl kernels directly instead? 


Thanks and Regards,

Pendyala Sesha Srinivas


0 Kudos
wdx04
Beginner
2,755 Views

Hi,

 

I see some speedup with the new driver, but not with the new compiler.

Old Compiler, New Driver:

OCND.jpg

New Compiler, New Driver:

NCND.jpg

The C++AMP version also benefits from the new driver:

AMPND.jpg

My real-world application needs to run multiple small SYCL kernels concurrently, it's impractical to merge them into large kernels. So I still need to find an effecient way to use GPU in a threaded application.

 

Thanks

0 Kudos
SeshaP_Intel
Moderator
2,556 Views

Hi,


Right now, We do not have any mechanisms to fuse small kernels. You can go through the latest GPU optimization guide and the latest version of the GPU Occupancy calculator which now supports Arc GPUs to identify the right subgroup and workgroup sizes for the application. Please refer to the below link for more details.

https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-1/thread-mapping-and-gpu-occupancy.html


Thanks and Regards,

Pendyala Sesha Srinivas


0 Kudos
SeshaP_Intel
Moderator
2,484 Views

Hi,


Could you please confirm whether we can close this thread from our end?


Thanks and Regards,

Pendyala Sesha Srinivas


0 Kudos
wdx04
Beginner
2,459 Views

Yes, please close this thread. Thanks.

0 Kudos
SeshaP_Intel
Moderator
2,405 Views

Hi,


Thanks for the confirmation. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.


Thanks and Regards,

Pendyala Sesha Srinivas


0 Kudos
Reply