Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*
713 Discussions

Parallel processing execution time overhead problem

k_higashi
Beginner
3,142 Views

Hello.

I executed a simple matrix operation in DPC++ parallelism using GPU.
But I found the following problem.

・It takes about 140ms overhead (offset) when executing DPC++ parallel_for.
・This overhead time is independent of matrix size.

Development Environment:

OS: Windows 10 Home (64bit)
CPU: Intel Corei7-1065G7 1.3GHz
GPU: Intel Iris Plus Graphics

Microsoft Visual Studio Professional 2019 Version 16.5.5
Intel oneAPI 2021.1 Beta08

 

To analyze this problem, I additionally conducted the following experiments.

I executed the parallel process(parallel_for) twice on the same code.
I share the source code used for the experiment. (TestDPC_TimeOverhedProblem.zip)

As a result of experimentation, I found that the processing time for this overhead was high at the first execution.

For example, when multiplying matrices with a matrix size of 256 x 256, the processing time is as follows.

1st execute :142[ms] (multiply.cpp line 47-64)
2nd execute : 2[ms] (multiply.cpp line 78-108)

For comparison, the first execution does not calculate in kernel.


The following are questions based on the facts so far.
QUESTION: 
1. About DPC++ specifications, is the above overhead a reasonable result?
2. What causes the overhead time in DPC++ processing?

3. Is it possible to reduce the overhead time by devising the implementation method?
-----
If this problem is true and the original processing time is small (140ms or less),
this means that using DPC++ parallel processing will be slower than not using it
In my environment, due to this overhead, if the matrix size is about 1024x1024, it is faster to calculate with a simple for statement of C++!

Best Regards.

 

0 Kudos
11 Replies
RahulV_intel
Moderator
3,133 Views

Hi,


Could you tell the time taken by the kernel, if you just run the actual computation kernel alone (only the 2nd one), in your environment?


If possible, run it for 5 times at least and average all the readings.


Thanks,

Rahul


0 Kudos
k_higashi
Beginner
3,126 Views

Thank you for your answer.

 

If i execute the actual computation kernel (only the 2nd matrix calculation),
it is equivalent to the total time when it is executed twice.
 
For example,

only the 2nd one: about 240ms

if i execute it twice
1st:  about 140ms
2nd: about 100ms


>If possible, run it for 5 times at least and average all the readings.

The variation in processing time is several ms in my environment.

If this very long overhead time is a common behavior of DPC++, I think it will be easily reproduced in your environment.

 

Best regards.

0 Kudos
RahulV_intel
Moderator
3,111 Views

Hi,

 

There are multiple reasons for the overheads (Jitting overhead, kernel create overhead etc) that kicks in during the first kernel launch. I'd suggest you to use Vtune profiler and see where exactly the overheads are coming from.

 

In order to gain performance out of DPC++, we need to ensure that the application is large enough (in terms of compute time taken), so as to mitigate the effect of these overheads. In short, there will be hardly any performance benefit if your application isn't large enough, in which case the overhead time could be quite significant when compared to the actual compute time.

 

My suggestion would be to run the compute kernel first (2nd kernel), followed by the dummy kernel (1st kernel). In this case, you should be able to notice the fact that it isn't the empty kernel that's taking time, in fact these are the overheads associated with the first time kernel launch.

 

Hope this helps.

 

Regards,

Rahul

 

0 Kudos
k_higashi
Beginner
3,104 Views
Thank you for your answer.
 
I understand that due to the DPC++ specification, there is overhead at the first kernel boot.
With this specification in mind, I will ask additional questions about how to programming.
 
If I want to do multiple operations on Intel GPU(DPC++), I do the following steps to program.
----
1. After launching the application, I create a queue for the selected device (GPU).
2. I run the kernel once with the queue created in step 1.
3. I use the single queue I created in step 1 to perform multiple required operations.
 (In other words, different DPC++ parallel processes are executed many times.)
---
In this way, I think the overhead time can only be impacted once.(only step2)
 
Question:
・Are there any problems or risks in using this method?
・If you have any problems, please tell me reason.

Best regards.
0 Kudos
RahulV_intel
Moderator
3,094 Views

Hi,

 

That's right. If you associate a single queue for multiple kernel invocations (targeting the same queue), the overhead time will come into effect only once (for the first kernel alone). It is also recommended/best-practice to have a only one queue associated to a particular device, in order to gain performance benefit.

 

If you have multiple queues associated with a single device, then there would be context_create, kernel_create/jitting overheads, each time you create a new queue (even though the device you target is the same). This approach is not recommended.

 

In short, it is considered to be a good practice to associate only one queue for a single device and target the same queue for kernel offload throughout your application.

 

Hope this helps.

 

Regards,

Rahul

 

0 Kudos
k_higashi
Beginner
3,068 Views
0 Kudos
LaurentPlagne
Novice
3,089 Views

I have observed the same behavior.

I think that at the first call, the kernel is compiled for the given device and arg types.

 

You can prevent this with AOT compilation :

https://software.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-compiler-dev-guide-and-reference/top/compilation/ahead-of-time-compilation.html

0 Kudos
k_higashi
Beginner
3,066 Views
Thank you.

I didn't know this compile option.
 
If the execution environment can be limited, this method can be used.
I will use your advice.
 
0 Kudos
RahulV_intel
Moderator
3,056 Views

Hi,


AOT does not exactly prevent all the overheads. Since you know the device that you would be targeting on, AOT is recommended. The target device is known at compile time, which enables AOT compiler to generate device specific code rather than a generic SPIR-V (in case of JIT).


There wouldn't be any jitting overhead when using AOT compilation, but there could be various other overheads present. I'd suggest you to profile your application using Vtune profiler for more details.


Could you let me know if I can close the thread from my end?


Thanks,

Rahul



0 Kudos
k_higashi
Beginner
3,048 Views

My question is complete.
Please close the thread.
Thank you.

0 Kudos
RahulV_intel
Moderator
3,040 Views

Thanks for the confirmation.


Intel will no longer monitor this thread. However, the thread will remain open for community discussion.


0 Kudos
Reply