I work on implementing local histograms on images in OpenCL. I was wondering if there is a speed penalty if I start a kernel for each histogram patch (subarray) instead of starting a single kernel that will go through all image pixels, find the current patch and calculate the histogram. From a programming point of view it seems simpler to launch something like 64 kernels each on a particular patch.
As with any claim of acheiving better speed details matter a lot... but I'll provide some general comments:
In general, start with
1) minimizing memory transfer ovehead.
2) exploiting embarrassingly parallel regions of code.
a single kernel that will go through through all image pixels
A single kernel work-item for all or many elements of a problem domain looks like #2 is not happening... and potentially you could be performing redundant compute and memory access (#1 not happening)
For a problem like this, I recommend evaluating a hierarchical solution. I.E. spawn a kernel workitem for each local histogram, then sum the local historgrams to get a final histogram. With this solution, performance is highly dependent on the number of pixels read to generate each local histogram.
Consider atomic builtin functions when necessary:
Consider trying subgroups:
Lastly, consider using Intel VTune Amplifier XE to perform GPU hotspots analysis on Intel systems. The basic hotspot analysis will give you feedback as to whether or not you're getting reasonable utility from the Intel Processor Graphics target. The summary page provides recommendations on the types of bottlenecks experienced by your application (if any).
All that being said... again, performance depends on many details... if developers can perform basic analysis of opposing schemes, this is best.