Intel® C++ Compiler
Support and discussions for creating C++ code that runs on platforms based on Intel® processors.
Announcements
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.
7768 Discussions

Low GPU utilization in multithreaded application

wdx04
Beginner
1,123 Views

Hi,

 

I'm porting some GPU algorithms from C++ AMP to DPC++/SYCL since C++ AMP had been deprecated by Microsoft.

Then I encountered some performance problems with the DPC++ compiler. 

The biggest problem is, when I submit multiple small SYCL kernels to a queue(no data dependency between them) inside a tbb:parallel_for loop, they seem to be executed sequentially. The average GPU utilization observed using GPU-Z is around 10%(A750). The original C++ AMP algorithm in the same tbb:parallel_for loop can execute in parallel and archive a GPU utilization of 80%-90%. Is it a limitation of the current compiler, or how can I find the cause of the problem?

Another problem is the optimization capability of the current compiler seems not very stable. When I upgrade the compiler from 2022.3 to 2023.0, the same SYCL kernel runs 50% faster with my Kabylake iGPU but also runs 40% slower with my Arc A750 GPU.  Are there any compiler options to optimize better for the Arc-series GPUs?

My test environments:

HM175, i7-7700HQ, HD630 Graphics, 8Gx2 DDR4, Windows 10 21H2

Z170, i7-6700K, Arc A750 Graphics(without Resizable Bar), 8Gx4 DDR4, Windows 10 22H2

Thanks.

0 Kudos
11 Replies
SeshaP_Intel
Moderator
1,096 Views

Hi,


Thank you for posting in Intel Communities.

Could you please share the complete reproducer code and steps to reproduce so that we can investigate the issue more from our end?


Thanks and Regards,

Pendyala Sesha Srinivas


wdx04
Beginner
1,081 Views

Hi,

Here are the reproducer code and image files used for testing:

https://1drv.ms/u/s!ApduJQeHcf7MgfEllZDK2OZP03mErg?e=huqPik

Please compile the code with Visual Studio 2022 and the DPC++ compiler, then run the problem with search.png and target.png in the same folder as the executable and monitor the GPU usage.

Note the program is part of a brute-force image template matching algorithm and requires OpenCV to compile.

You may also try the compiled executable package with everything except the Visual C++ 2022 runtime:

https://1drv.ms/u/s!ApduJQeHcf7MgfEmn1m-vdH3MAKCtA?e=bUyt7x

Typical output of the program compiled using DPC++ 2023.0:

Device Name: Intel(R) Arc(TM) A750 Graphics
Number of Points: 135108
Number of Transforms: 9261
Best Score: 26153
Best Transform: 0.999714,0.0239087,181.7;-0.0239087,0.999714,193.4
Benchmarking Parallel execution...
Parallel Iteration time: 30.25ms.
Benchmarking Sequential execution...
Sequential Iteration time: 32.84ms.

(Note: When compiled using DPC++ 2022.3, the sequential execution time is ~22 ms, and the idential program in C++ AMP uses ~24.6ms in parallel model and ~30ms in sequential mode)

SeshaP_Intel
Moderator
1,007 Views

Hi,

 

While trying to build your code, we are facing some errors in Visual Studio 17 2022. Please find the error screenshot below.

SeshaP_Intel_0-1674023082554.png

 

Could you please provide the detailed steps to investigate this issue? And please provide dependencies if you have added any.

Could you please help us to resolve this issue?

 

Thanks and Regards,

Pendyala Sesha Srinivas

wdx04
Beginner
970 Views

No, I never see this kind of errors before.

Please check:

1. Which OpenCV version are you using? I'm using OpenCV 4.5.5 built with vcpkg. The official 4.5.5 binary release should also be OK. You may comment out the lines of code using the Matx23f type inside the kernel. If the errors are gone, the problem must be related to OpenCV.

2. Which build configuration are you using? I'm using the x64 Release configuration. The platform toolset is set to Intel oneAPI DPC++ Compiler 2023, C++ languange standard is set to C++17, other configurations may not work.

SeshaP_Intel
Moderator
898 Views

Hi,

 

We have tried OpenCV-4.7.0 and 4.5.5 versions and using the x64 Release configuration. Please provide the library dependencies if you have added any. We are facing linking errors related to OpenCV while building the code. Please find the error screenshot below. 

SeshaP_Intel_0-1674821398124.png

Could you please help us resolve this issue?

 

Thanks and Regards,

Pendyala Sesha Srinivas

 

wdx04
Beginner
879 Views

Hi,

 

The missing symbol "void Mat::copyTo(const _OutputArray&) const" is part of the OpenCV core module.

The import library containing this symbol depends on where did you get the OpenCV package. For example,

OpenCV 4.5.5 official package: opencv\build\x64\vc15\lib\opencv_world455.lib

OpenCV 4.7.0 official package: opencv\build\x64\vc16\lib\opencv_world470.lib

Built from source with vcpkg, without [world] option: vcpkg\installed\x64-windows\lib\opencv_core.lib (and opencv_imgcodecs.lib)

Built from source with vcpkg, with [world] option: vcpkg\installed\x64-windows\lib\opencv_world.lib

 

Thanks.

 

SeshaP_Intel
Moderator
798 Views

Hi,

 

We are able to reproduce your issue. I have tried to run the executable on Intel(R) UHD Graphics. Please find the below output screenshot.

SeshaP_Intel_0-1675241848737.png

Did you get the same output while running with HD630 Graphics? If not please provide the output generated while running on HD630 Graphics.

 

Thanks and Regards,

Pendyala Sesha Srinivas

wdx04
Beginner
772 Views

Hi,

 

Yes, I got the same output on i7-7700HQ except for the device name and interation times:

results.jpg

Here is the output from the C++AMP version of the program:

results2.jpg

The outputs when running on i7-6700K+A750:

results3.jpg

Thanks.

 

SeshaP_Intel
Moderator
570 Views

Hi,


Thanks for providing the information.

Could you please provide C++ AMP files(executable also)?

How are you calculating the GPU utilization(%)?

Could you please confirm that you are expecting a better performance in Parallel Iteration time on Intel(R) Arc(TM) A750 Graphics too?


Thanks and Regards,

Pendyala Sesha Srinivas


wdx04
Beginner
555 Views

Hi,

Here are the C++AMP files:

https://1drv.ms/u/s!ApduJQeHcf7MgfEnByQHECipWDq_5g?e=5VeLQi

I just monitor the GPU utilization with GPU-z or Windows Task Manager.

Of course, I'm expecting better performance in both  Parallel Iteration time and  Sequential Iteration time on Intel(R) Arc(TM) A750 Graphics. As mentioned before, performance on A750 was much better when compiled using the previous 2022.3 version of the DPC++ compiler.

Thanks.

SeshaP_Intel
Moderator
445 Views

Hi,


We were able to reproduce your issue. We are working on this internally.

We will get back to you soon.


Thanks and Regards,

Pendyala Sesha Srinivas


Reply