I'm working on writing a global reduction in OpenCL 2.0. I started with the implementation from CLOGS:
Essentially, the approach is just a series of workgroup wide reductions that are combined at the end. I thought I would try updating the implementation to use the OpenCL 2.0 workgroup built-in reduction, i.e. work_group_reduce_add().
I was surprised that the global reduction performs slower when the workgroup reductions are computed using the built-in reduction. Specifically, I ran a test of 1000 reductions on randomly sized arrays (sizes in range 1 - 100000). The random numbers are provided the same seed, so they will be the same for different runs.
Using the built-in reduction, the combined total kernel time is ~39 ms. Using the CLOGS approach, the total kernel time is ~32 ms. I was also surprised to see that the kernel using the built-in reduction used 1796 bytes of local memory, while the CLOGS approach used only 1028 bytes (as reported by OpenCL code builder).
We're interested in learning why the built-in reduction appears to be slower and use more resources than the simple CLOGS approach. Perhaps there are trade-offs we're not aware of? We get similar results on an AMD GPU.
I've attached the kernel file that implements the global reduction (with both the built-in and CLOG approach to workgroup wide reductions).
The GPU I am using is:
Thanks in advance!
Thanks for bringing this to our attention and sorry for the delayed reply. We will definitely want to look into this further as it could mean there are opportunities for improvement in our implementation. For now, to be honest, we have some hunches but so far no obvious answer for the performance difference.
Are the reductions a critical bottleneck for your application? Or does the CLOGS approach meet your needs?
Thanks for the reply Jeffrey! We'd definitely be interested in hearing more details if/when they become available.
Reductions are not our critical bottleneck and the CLOGS approach will work fine for now. We will definitely switch to the builtins if/when the performance catches up. I really like the idea of these builtins (much easier than trying to write efficient reductions/scans for different chips).
I have experienced similar issues on other platforms, as well. As such, I had written a small benchmark for the evaluation of workgroup/subgroup reductions.
For more information you may check:
Very interesting results Elias, thanks for sharing!
If I'm reading your results correctly, the only platform where the Workgroup function outperforms the shared memory implementation is the Intel CPU (and by a considerable ~6x amount). On some of the AMD chips, it looks like the hybrid approach gives some small performance benefits.
I'll try running this on our Intel HD5500 and Iris 6100 sometime this week and let you know the results.