Strange behavior of dynamic parallelism

spiridonov__igor · ‎10-08-2018

Hello.

I'm running a kernel on gpu intel graphics device(Intel® HD Graphics 630 - latest driver) with such settings.

=============device=============

DEVICE TYPE = CL_DEVICE_TYPE_GPU

MAX_GROUP_SIZE = 256

LOCAL_MEM_SIZE = 65536

MAX_COMPUTE_UNITS = 24

MAX_WORK_ITEM_DIMENSIONS = 3

dim[0] = 256 dim[1] = 256 dim[2] = 256

enqueue_kernel can fail due to lack of resources. It's ok but I don't understand why it fails so early. It fails when number of global items is more then 256 - only one dimension. Why not 256 * 256 * 256?

The second problem is that I don't know how to deal when it fails. I tried to check returned value in a loop like that - while(0 != enqueue_kernel(.. but the program hangs.

Here is my experiment - https://github.com/OmegaDoom/enque_kernel_test

Ben_A_Intel · ‎10-10-2018

Hi Igor,

For Intel GPUs, at least with recent drivers, the default device queue size is 128KB. The amount of space required in the device queue per-enqueue is is roughly 128 bytes, plus some extra space for values captured by the enqueued child kernel / block. This means you have about 1000 enqueues maximum before running out of space, by default. If you have a bunch of work items, and each is enqueing a child kernel, as you've seen it's relatively easy to run out of space.

If you require a larger number of in-flight enqueues then can you create a larger device queue? Intel GPUs with recent drivers support a maximum device queue size of 64MB, which should give you quite a bit of additional headroom.

spiridonov__igor · ‎10-11-2018

Thank you Ben.

Now I can enqueue a lot of instances.

But what to do if we run out of queue size anyway? Let's say if we have a sorting algorithm with huge data. What can we do if enqueue_kernel returned error? I have tried loop in which I compare returned value with non 0 but program hangs.

Ben_A_Intel · ‎10-15-2018

spiridonov, igor wrote:

But what to do if we run out of queue size anyway? Let's say if we have a sorting algorithm with huge data. What can we do if enqueue_kernel returned error? I have tried loop in which I compare returned value with non 0 but program hangs.

This isn't an easy question to answer, unfortunately.

One key thing to realize is that for the Intel GPU device enqueue implementation your application kernels that are producing enqueues into the device queue are sharing computational resources with the driver "scheduler" kernel that is consuming the enqueued kernels out of the device queue. This means: you may deadlock / generate an out-of-queue error if an application kernel enqueues too many kernels without giving the scheduler kernel a chance to execute and drain the queue.

This is an imperfect rule, but in general I would try to size your device queue such that it can store an entire "batch" of application enqueues, either by creating a relatively large device queue, or enqueuing a relatively small amount of child enqueues, or some combination.

Would you mind describing in more detail what you're hoping to accomplish with device enqueue? Thanks!

spiridonov__igor · ‎10-16-2018

Thank you Ben a lot.

This information is very helpful. It would be nice to have a possibility to calculate the maximum number of queue elements but I don't see how to get the size of one kernel. Maybe a workaround solution can be to have several queues.

I don't have a particular goal but i'm interested in sorting on gpu where data sets can be very big