Parallel processing execution time overhead problem

k_higashi · ‎08-16-2020

Hello.

I executed a simple matrix operation in DPC++ parallelism using GPU.
But I found the following problem.

・It takes about 140ms overhead (offset) when executing DPC++ parallel_for.
・This overhead time is independent of matrix size.

Development Environment:

OS: Windows 10 Home (64bit)
CPU: Intel Corei7-1065G7 1.3GHz
GPU: Intel Iris Plus Graphics

Microsoft Visual Studio Professional 2019 Version 16.5.5
Intel oneAPI 2021.1 Beta08

To analyze this problem, I additionally conducted the following experiments.

I executed the parallel process(parallel_for) twice on the same code.
I share the source code used for the experiment. (TestDPC_TimeOverhedProblem.zip)

As a result of experimentation, I found that the processing time for this overhead was high at the first execution.

For example, when multiplying matrices with a matrix size of 256 x 256, the processing time is as follows.

1st execute ：142[ms] (multiply.cpp line 47-64)
2nd execute : 2[ms] (multiply.cpp line 78-108)

For comparison, the first execution does not calculate in kernel.

The following are questions based on the facts so far.
QUESTION:
1. About DPC++ specifications, is the above overhead a reasonable result?
2. What causes the overhead time in DPC++ processing?

3. Is it possible to reduce the overhead time by devising the implementation method?
-----
If this problem is true and the original processing time is small (140ms or less),
this means that using DPC++ parallel processing will be slower than not using it
In my environment, due to this overhead, if the matrix size is about 1024x1024, it is faster to calculate with a simple for statement of C++!

Best Regards.

RahulV_intel · ‎08-17-2020

Hi,

Could you tell the time taken by the kernel, if you just run the actual computation kernel alone (only the 2nd one), in your environment?

If possible, run it for 5 times at least and average all the readings.

Thanks,

Rahul

k_higashi · ‎08-17-2020

Thank you for your answer.

If i execute the actual computation kernel (only the 2nd matrix calculation),
it is equivalent to the total time when it is executed twice.

For example,

only the 2nd one: about 240ms

if i execute it twice
1st: about 140ms
2nd: about 100ms

>If possible, run it for 5 times at least and average all the readings.

The variation in processing time is several ms in my environment.

If this very long overhead time is a common behavior of DPC++, I think it will be easily reproduced in your environment.

Best regards.

RahulV_intel · ‎08-18-2020

Hi,

There are multiple reasons for the overheads (Jitting overhead, kernel create overhead etc) that kicks in during the first kernel launch. I'd suggest you to use Vtune profiler and see where exactly the overheads are coming from.

In order to gain performance out of DPC++, we need to ensure that the application is large enough (in terms of compute time taken), so as to mitigate the effect of these overheads. In short, there will be hardly any performance benefit if your application isn't large enough, in which case the overhead time could be quite significant when compared to the actual compute time.

My suggestion would be to run the compute kernel first (2nd kernel), followed by the dummy kernel (1st kernel). In this case, you should be able to notice the fact that it isn't the empty kernel that's taking time, in fact these are the overheads associated with the first time kernel launch.

Hope this helps.

Regards,

Rahul

k_higashi · ‎08-19-2020

Thank you for your answer.

I understand that due to the DPC++ specification, there is overhead at the first kernel boot.

With this specification in mind, I will ask additional questions about how to programming.

If I want to do multiple operations on Intel GPU(DPC++), I do the following steps to program.

----
1. After launching the application, I create a queue for the selected device (GPU).

2. I run the kernel once with the queue created in step 1.

3. I use the single queue I created in step 1 to perform multiple required operations.
　(In other words, different DPC++ parallel processes are executed many times.)
---

In this way, I think the overhead time can only be impacted once.(only step2)

Question:

・Are there any problems or risks in using this method?
・If you have any problems, please tell me reason.

Best regards.

RahulV_intel · ‎08-19-2020

Hi,

That's right. If you associate a single queue for multiple kernel invocations (targeting the same queue), the overhead time will come into effect only once (for the first kernel alone). It is also recommended/best-practice to have a only one queue associated to a particular device, in order to gain performance benefit.

If you have multiple queues associated with a single device, then there would be context_create, kernel_create/jitting overheads, each time you create a new queue (even though the device you target is the same). This approach is not recommended.

In short, it is considered to be a good practice to associate only one queue for a single device and target the same queue for kernel offload throughout your application.

Hope this helps.

Regards,

Rahul

k_higashi · ‎08-23-2020

Thank you.

I use this method.

LaurentPlagne · ‎08-20-2020

I have observed the same behavior.

I think that at the first call, the kernel is compiled for the given device and arg types.

You can prevent this with AOT compilation :

https://software.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-compiler-dev-guide-and-reference/top/compilation/ahead-of-time-compilation.html

k_higashi · ‎08-23-2020

Thank you.

I didn't know this compile option.

If the execution environment can be limited, this method can be used.
I will use your advice.

RahulV_intel · ‎08-24-2020

Hi,

AOT does not exactly prevent all the overheads. Since you know the device that you would be targeting on, AOT is recommended. The target device is known at compile time, which enables AOT compiler to generate device specific code rather than a generic SPIR-V (in case of JIT).

There wouldn't be any jitting overhead when using AOT compilation, but there could be various other overheads present. I'd suggest you to profile your application using Vtune profiler for more details.

Could you let me know if I can close the thread from my end?

Thanks,

Rahul

k_higashi · ‎08-25-2020

My question is complete.
Please close the thread.
Thank you.

RahulV_intel · ‎08-26-2020

Thanks for the confirmation.

Intel will no longer monitor this thread. However, the thread will remain open for community discussion.