I executed a simple matrix operation in DPC++ parallelism using GPU.
But I found the following problem.
・It takes about 140ms overhead (offset) when executing DPC++ parallel_for.
・This overhead time is independent of matrix size.
OS: Windows 10 Home (64bit)
CPU: Intel Corei7-1065G7 1.3GHz
GPU: Intel Iris Plus Graphics
Microsoft Visual Studio Professional 2019 Version 16.5.5
Intel oneAPI 2021.1 Beta08
To analyze this problem, I additionally conducted the following experiments.
I executed the parallel process(parallel_for) twice on the same code.
I share the source code used for the experiment. (TestDPC_TimeOverhedProblem.zip)
As a result of experimentation, I found that the processing time for this overhead was high at the first execution.
For example, when multiplying matrices with a matrix size of 256 x 256, the processing time is as follows.
1st execute ：142[ms] (multiply.cpp line 47-64)
2nd execute : 2[ms] (multiply.cpp line 78-108)
For comparison, the first execution does not calculate in kernel.
The following are questions based on the facts so far.
1. About DPC++ specifications, is the above overhead a reasonable result?
2. What causes the overhead time in DPC++ processing?
3. Is it possible to reduce the overhead time by devising the implementation method?
If this problem is true and the original processing time is small (140ms or less),
this means that using DPC++ parallel processing will be slower than not using it
In my environment, due to this overhead, if the matrix size is about 1024x1024, it is faster to calculate with a simple for statement of C++!
Could you tell the time taken by the kernel, if you just run the actual computation kernel alone (only the 2nd one), in your environment?
If possible, run it for 5 times at least and average all the readings.
Thank you for your answer.
If i execute the actual computation kernel (only the 2nd matrix calculation),
it is equivalent to the total time when it is executed twice.
only the 2nd one: about 240ms
if i execute it twice
1st: about 140ms
2nd: about 100ms
>If possible, run it for 5 times at least and average all the readings.
The variation in processing time is several ms in my environment.
If this very long overhead time is a common behavior of DPC++, I think it will be easily reproduced in your environment.
There are multiple reasons for the overheads (Jitting overhead, kernel create overhead etc) that kicks in during the first kernel launch. I'd suggest you to use Vtune profiler and see where exactly the overheads are coming from.
In order to gain performance out of DPC++, we need to ensure that the application is large enough (in terms of compute time taken), so as to mitigate the effect of these overheads. In short, there will be hardly any performance benefit if your application isn't large enough, in which case the overhead time could be quite significant when compared to the actual compute time.
My suggestion would be to run the compute kernel first (2nd kernel), followed by the dummy kernel (1st kernel). In this case, you should be able to notice the fact that it isn't the empty kernel that's taking time, in fact these are the overheads associated with the first time kernel launch.
Hope this helps.
That's right. If you associate a single queue for multiple kernel invocations (targeting the same queue), the overhead time will come into effect only once (for the first kernel alone). It is also recommended/best-practice to have a only one queue associated to a particular device, in order to gain performance benefit.
If you have multiple queues associated with a single device, then there would be context_create, kernel_create/jitting overheads, each time you create a new queue (even though the device you target is the same). This approach is not recommended.
In short, it is considered to be a good practice to associate only one queue for a single device and target the same queue for kernel offload throughout your application.
Hope this helps.
I have observed the same behavior.
I think that at the first call, the kernel is compiled for the given device and arg types.
You can prevent this with AOT compilation :
AOT does not exactly prevent all the overheads. Since you know the device that you would be targeting on, AOT is recommended. The target device is known at compile time, which enables AOT compiler to generate device specific code rather than a generic SPIR-V (in case of JIT).
There wouldn't be any jitting overhead when using AOT compilation, but there could be various other overheads present. I'd suggest you to profile your application using Vtune profiler for more details.
Could you let me know if I can close the thread from my end?