Can host threads execute kernels concurrently with intel sdk for opencl? I heard that kernels(commands) from different command-queues will be executed concurrently on the device. Is that true? And, is "Device Fission" supported on GPU with Intel opencl driver now? That may be another way to implement it.
I use: Intel Core i7, Intel HD Graphics 4600, Intel sdk for opencl.
THX,
Lingzhi
链接已复制
Lingzhi,
Kernels cannot be executed concurrently on the GPU device using current production drivers. The Device Fission feature is available on the OpenCL CPU device only: see https://software.intel.com/en-us/articles/opencl-device-fission-for-cpu-performance.
Robert
allanmac,
We are collecting the requirements and use cases for the concurrent kernel execution. Please let me know what they are and I will forward it to our product team. They are hesitant to add that functionality at the moment due to lack of demand and realistic use cases.
OK. Here's my use case:
I have an advanced pipeline of kernels that are designed to run concurrently. Inter-kernel dependencies are currently managed by the kernel launching logic and kernel-completion callbacks but at some point I may dump this work onto the OpenCL event system if it further reduces system latency.
Some of the kernels are computationally intense. Others are not. All run for short durations (from microseconds to at most a few milliseconds).
I don't care about presenting enough work to the IGP for it to reach its peak clock speed since I always have the option to make that happen by queuing up more work for the IGP.
But I do care about latency... which is why I really want concurrent kernels.
---
That being said, I understand why the smaller IGPs probably aren't going to benefit much from concurrent kernel execution.
But a double or triple-slice IGP seems like it would be a good environment for concurrent kernel execution. :)
allanmac,
In the short term, we have nested parallelism in OpenCL 2.0 (kernels launching other kernels), which should improve latency situation. For more on nested parallelism, see my article https://software.intel.com/en-us/articles/gpu-quicksort-in-opencl-20-using-nested-parallelism-and-work-group-scan-functions
You can also watch short videos on nested parallelism here:
- https://software.intel.com/en-us/videos/implementing-sierpi-ski-carpet-in-opencl-20
- https://software.intel.com/en-us/videos/gpu-quicksort-in-opencl-20
I will forward your input to our product team.
