A question about subgroup application

Scout · ‎12-16-2022

In the artical "Box Blur Filter Using Intel Subgroup Extensions in OpenCL™" : https://www.intel.com/content/www/us/en/developer/articles/technical/box-blur-filter-using-intel-subgroup-extensions-in-opencl.html

The author gives an example of how to use subgroup shuffle to share data with work-items. In the chapter "OpenCL Application For Box Blur Filter Using Intel Subgroup Extensions", the author says: "The number of times the kernel is dispatched is less; the work item handles more workload as the kernel now computes for 16 pixels."

But in the psudo code the article gives, the step is still 1 pixel:

which means, the first work item calculates 16 pixels, as the picture shows:

the second work item calculates 16 pixels, like below:

It means, the first and the second work-item do a lot of repeated work. Does this waste a lot of FLOPS?

Am I understanding it correctly?

From my understanding, different work-items can share data by using subgroup extension, but one should not let a work-item to do more work than naive implementation. Or else, there should be only one work-item in a subgroup. This "bigger" work-item does more job using registers instead of shared local memory to get higher efficiency.

Please correct me if I get this wrong. Thanks a lot!

Scout · ‎12-16-2022

Or, subgroup does not mean it could help work-items to share data. It actually means, one work item could load data into registers instead of shared local memory by using subgroup block access commands. In this way, there is only one "real" work item in one subgroup.

SeshaP_Intel · ‎12-20-2022

Hi,

Thank you for posting in Intel Communities.

Have you tried to run the code at your end?

Could you please provide the following details to investigate the issue more from our end?

1. Hardware details, Graphics card and driver version used.

2. Complete steps you have followed to reproduce the issue.

Thanks and Regards,

Pendyala Sesha Srinivas

SeshaP_Intel · ‎12-27-2022

Hi,

We haven't heard back from you. Could you please provide an update on your issue?

Thanks and Regards,

Pendyala Sesha Srinivas

SeshaP_Intel · ‎01-03-2023

Hi,

We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.

Thanks and Regards,

Pendyala Sesha Srinivas

Scout · ‎01-10-2023

hi! Sorry for late reply. I've figured it out. To use subgroup function, a work item is like naive implemention, but the work items in a subgroup could share data. So, the workload for one work item does not become bigger. Please close this thread. Thanks a lot!