- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
1. When I try converting cuda to dpc++ by dpct on devcloud, dpct reports the warning on cooperative_group.
DPC++ Compatibility Tool.
In DPCT beta 7 update, it adds cooperative group api from (https://software.intel.com/content/www/us/en/develop/articles/release-notes-for-intel-dpcpp-compatibility-tool.html)
Is anything I miss here?
the dpct version is
2. is subgroup equal to the warp in CUDA or wavefront in AMD with different size?
subgroup only support size = 8/16/32. does it keep consistent in all intel gpu?
I do not need any sync in subgroup if all operations are in the same subgroup, is it correct?
Thanks,
mike
- Tags:
- cooperative group
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
1. CUDA Co-operative groups is only partially supported by DPCT at the moment (Not all Co-operative groups' APIs are supported) . If you could send me the exact API call that is unsupported or preferably a minimal reproducible sample code which includes this API call, that would be great.
2. In my opinion, the difference between subgroups and warps/wavefronts is that the size of subgroups can be altered (8/16/32) programmatically. Whereas, the size of warps/wavefronts are fixed.
The work-items within a sub-group can communicate and synchronize independently of work-items in other sub-groups, and sub-groups are therefore commonly mapped to SIMD hardware where it exists.
If you are doing the same operation, explicit sync is not needed in my opinion.
The function call for work-item synchronization within a subgroup is:
void barrier() const
Thanks,
Rahul
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @RahulV_intel ,
Thanks for quick reply.
I put the subwarp_reduce in the following, which is failed in converting.
In kernel, do I need to call barrier to make sure the data is loaded in each threads of subgroup when using intel subgroup?
#include <iostream>
#include <cooperative_groups.h>
__global__ __launch_bounds__(32) void reduce(float *a) {
auto subwarp = cooperative_groups::tiled_partition<16>(cooperative_groups::this_thread_block());
auto local_data = a[threadIdx.x];
// in intel subgroup, does it need to call barrier to make sure the data is already loaded?
#pragma unroll
for (int bitmask = 1; bitmask < subwarp.size(); bitmask <<= 1) {
const auto remote_data = subwarp.shfl_xor(local_data, bitmask);
local_data = local_data + remote_data;
}
a[threadIdx.x] = local_data;
}
int main() {
float *d_A;
float data[32];
for (int i = 0; i < 32; i++) {
// 0~15: 0, 16~31: 1
data[i] = i/16;
}
cudaMalloc( &d_A, 32 * sizeof( float ) );
cudaMemcpy(d_A, data, 32 * sizeof(float), cudaMemcpyHostToDevice);
reduce<<<1, 32>>>(d_A);
cudaMemcpy(data, d_A, 32*sizeof(float), cudaMemcpyDeviceToHost);
bool passed = true;
for (int i = 0; i < 32; i++) {
if (i < 16) {
passed &= (data[i] == 0);
} else {
passed &= (data[i] == 16);
}
}
std::cout << "subwarp reduction is "<< (passed ? "passed" : "failed") << std::endl;
}
Thanks,
Mike
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Like I said, explicit barrier is not required if the work-items of a sub-group are performing a similar task. It is also not required at the time of initialization.
Thanks for providing the CUDA source file. Some co-operative groups API are not supported currently. I've let this known to the concerned team.
Thanks,
Rahul
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I did some manual implementation for this reduction code.
The following can be compiled by dpc++.
#include <CL/sycl.hpp>
#include <dpct/dpct.hpp>
#include <iostream>
template<int subgroup_size>
[[ cl::intel_reqd_sub_group_size(subgroup_size) ]] void reduce(float *a, sycl::nd_item<3> item_ct1) {
auto subwarp = item_ct1.get_sub_group();
auto local_data = a[item_ct1.get_local_id(2)];
#pragma unroll
for (int bitmask = 1; bitmask < subgroup_size; bitmask <<= 1) {
const auto remote_data = subwarp.shuffle_xor(local_data, bitmask);
local_data = local_data + remote_data;
}
a[item_ct1.get_local_id(2)] = local_data;
}
int main() {
dpct::device_ext &dev_ct1 = dpct::get_current_device();
sycl::queue &q_ct1 = dev_ct1.default_queue();
float *d_A;
float data[32];
for (int i = 0; i < 32; i++) {
// 0~15: 0, 16~31: 1
data[i] = i/16;
}
d_A = sycl::malloc_device<float>(32, q_ct1);
q_ct1.memcpy(d_A, data, 32 * sizeof(float)).wait();
q_ct1.submit([&](sycl::handler &cgh) {
cgh.parallel_for(sycl::nd_range<3>(sycl::range<3>(1, 1, 32),
sycl::range<3>(1, 1, 32)),
[=](sycl::nd_item<3> item_ct1) {
reduce<8>(d_A, item_ct1);
});
});
q_ct1.memcpy(data, d_A, 32 * sizeof(float)).wait();
bool passed = true;
for (int i = 0; i < 32; i++) {
if (i < 16) {
passed &= (data[i] == 0);
} else {
passed &= (data[i] == 8);
}
}
std::cout << "subwarp reduction is "<< (passed ? "passed" : "failed") << std::endl;
return 0;
}
I only find the way to set the subgroup size by `[[ cl::intel_reqd_sub_group_size(subgroup_size)]]`.
putting it after function name is failed and`intel::reqd_sub_group_size` from https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc is not working.
Is there any other way to set the subgroup size? and it can only allow one kernel with one subgroup size.
Thanks,
Mike
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
beta09 fixes the keyword issue.
`intel::reqd_sub_group_size` is allowable in beta09
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your confirmation. Closing this case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page