Which implementation would be more effective? A huge single WI kernel (communication is assumed using local memory within the kernel) or a distributed kernel with communicating using channels?I see lot of implementations using distributed kernel approach. Writing the entire system in single WI kernel is challenging?
Merging everything in one single work-item kernel has the advantage of minimizing area utilization and creating the deepest-possible pipeline to maximize absorption of external memory stalls. However, having a single large kernel could complicate the circuit, resulting in a longer critical path and lower operating frequency. Apart from that, in many cases it is easier for a programmer to split the code into multiple kernels connected with channels to improve code readability and ease debugging. Finally, in some cases, the implicit synchronization offered by channels can allow achieving better performance by allowing different parts of the application run in parallel in multiple kernels running in different queues, and implicitly synchronized using the channels connecting them.