i would like to port the following algorithm to OpenCL, and it seems device-side enqueues can help me to improve performace.
I want to count how many times happen the following fact. At the i step, I pick up i numbers, sum them and check if it is bigger than a threshold. If it is bigger than the threshold, I move to step i+1 until I get to final stage. For instance, at step number 4, I will pick 4 numbers, sum them and check if it is bigger than a threshold. If it is bigger, I will repeat the process picking up 5 numbers. If I get to final step F, I count one. I run this small piece of code N times. Real algorithm classifies pixels of an image into two categories (pixels that pass all steps and pixels that can not pass all steps) where N is indeed width*height of the image.
Characteristics of the algorithm:
- every new step is more computationally expensive (I pick up more numbers), but is computed less times
- time of computation of each step could be easily calculated
- unknown when the sum of numbers will not be bigger than the threshold
I have tried the following naïve algorithm in OpenCL:
As I need to run the algorithm N times, I launch N kernels and each thread is in charge of the whole process. Main problem is unbalance workload between threads. Two threads of the warp can end at different stages, (one thread at stage 2, but, another one, at stage 10), so most of the threads are idle as they need to wait for other threads.
So far, I have different ideas:
- Launch from host N kernels that takes care of j steps (where j is much smaller than the total number of steps), compact threads that should still working and launch them again from host. Launch and compact threads until I get to final step.
- Launch from host N kernels that takes care of j steps and continue later on CPU. CPU will be in charge of last computationally expensive steps.
- Launch from host N kernels, and each kernel from host will launch mini kernels that will be in charge of only one step.
If I use nested parallelism, will I increase the occupancy of my threads? Which solution should have better performance? Will a kernel that launches only one mini kernel need to wait a kernel that launches many of them?
Thanks in advance
One more idea is to:
Launch from host N kernels that takes care of j steps (where j is much smaller than the total number of steps), compact threads that should still working and launch them again from DEVICE. Launch and compact threads until you get to final step. Launching and compacting all done on GPU.
Here is my experience with device-side enqueue: the real time savings come from reducing distance between kernel launches, e.g. if you need to launch ~50 kernels and every time that you finish one kernel you need to make a decision whether to launch another one or not - that decision could be made on the GPU and you save quite a bit of time on enqueue.
Important thing is: a kernel currently needs to fill the device - we are working on concurrent kernels, but they are not in the production drivers yet.
In your case you need to make sure that as you launch new kernels they fill the device, so my suggestion at the top could lead to better results. For decision making it is OK to launch a kernel with NDRange of size 1 that will in turn launch other kernels. For an example of this, please see my GPU-Quicksort for OpenCL 2.0 article https://software.intel.com/en-us/articles/gpu-quicksort-in-opencl-20-using-nested-parallelism-and-work-group-scan-functions - the sample is at the bottom of the article.
Realistically, you will need to implement a number of different approaches and see how they stack up against each other. Let me know how it goes!