- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the artical "Box Blur Filter Using Intel Subgroup Extensions in OpenCL™" : https://www.intel.com/content/www/us/en/developer/articles/technical/box-blur-filter-using-intel-subgroup-extensions-in-opencl.html
The author gives an example of how to use subgroup shuffle to share data with work-items. In the chapter "OpenCL Application For Box Blur Filter Using Intel Subgroup Extensions", the author says: "The number of times the kernel is dispatched is less; the work item handles more workload as the kernel now computes for 16 pixels."
But in the psudo code the article gives, the step is still 1 pixel:
which means, the first work item calculates 16 pixels, as the picture shows:
the second work item calculates 16 pixels, like below:
It means, the first and the second work-item do a lot of repeated work. Does this waste a lot of FLOPS?
Am I understanding it correctly?
From my understanding, different work-items can share data by using subgroup extension, but one should not let a work-item to do more work than naive implementation. Or else, there should be only one work-item in a subgroup. This "bigger" work-item does more job using registers instead of shared local memory to get higher efficiency.
Please correct me if I get this wrong. Thanks a lot!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Or, subgroup does not mean it could help work-items to share data. It actually means, one work item could load data into registers instead of shared local memory by using subgroup block access commands. In this way, there is only one "real" work item in one subgroup.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for posting in Intel Communities.
Have you tried to run the code at your end?
Could you please provide the following details to investigate the issue more from our end?
1. Hardware details, Graphics card and driver version used.
2. Complete steps you have followed to reproduce the issue.
Thanks and Regards,
Pendyala Sesha Srinivas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We haven't heard back from you. Could you please provide an update on your issue?
Thanks and Regards,
Pendyala Sesha Srinivas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.
Thanks and Regards,
Pendyala Sesha Srinivas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi! Sorry for late reply. I've figured it out. To use subgroup function, a work item is like naive implemention, but the work items in a subgroup could share data. So, the workload for one work item does not become bigger. Please close this thread. Thanks a lot!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page