Please take a look at my OpenCL 2.0 tutorial on the use of enqueue_kernel and work-group scan functions. It also has a very cool algorithm, GPU-Quicksort, implemented in both OpenCL 1.2 and 2.0.
Let me know what you think!
GPU-Quicksort in OpenCL 2.0 is a very fast sorting algorithm when run on the latest and greatest Intel Processors with Intel Processor Graphics. It is faster than Parallel CPU Quicksort by about 15%. At the same time, GPU-Quicksort is a great way to showcase device self-enqueue and work group scan functions in OpenCL.