Hello,I appreciate it if anyone can answer my question. I have a ND-range configuration, with 1 CU, and without SIMD (I mean a single PE in the single CU). Work-items access only to global memory (no local memory access). After some tests, I reach to this conclusion that for every CU, it is enough to have large number of work-items to have good performance, specifically, no matter that those work-items are organized in small number of large work-groups, or large number of small work-groups. In other words, there is only one pipeline in work-items level, and there is NO additional pipeline in work-group level. Is it right?
Yes, every CU only has one pipeline, and every work-item, regardless of which work-group it belongs to, goes through that pipeline. With one CU, if you are not using SIMD or local memory, the size of your work-group will not affect performance.