Intel® oneAPI Base Toolkit
Support for the core tools and libraries within the base toolkit that are used to build and deploy high-performance data-centric applications.
418 Discussions

dpct subgroup and cooperative group (DPCT1007)

yhmtsai
Beginner
2,055 Views

Dear all,

1. When I try converting cuda to dpc++ by dpct on devcloud, dpct reports the warning on cooperative_group.

DPCT1007:0: Migration of this CUDA API is not supported by the Intel(R)
DPC++ Compatibility Tool.

In DPCT beta 7 update, it adds cooperative group api from (https://software.intel.com/content/www/us/en/develop/articles/release-notes-for-intel-dpcpp-compatibility-tool.html)

Is anything I miss here?
the dpct version is 
Intel(R) DPC++ Compatibility Tool Version: 2021.1-beta08 codebase:(e0b12aa57579014d41e1cd86ecbaaee7de878ce8)

2. is subgroup equal to the warp in CUDA or wavefront in AMD with different size?
subgroup only support size = 8/16/32. does it keep consistent in all intel gpu?
I do not need any sync in subgroup if all operations are in the same subgroup, is it correct?


Thanks,
mike

0 Kudos
7 Replies
RahulV_intel
Moderator
2,037 Views

Hi,

 

1. CUDA Co-operative groups is only partially supported by DPCT at the moment (Not all Co-operative groups' APIs are supported) . If you could send me the exact API call that is unsupported or preferably a minimal reproducible sample code which includes this API call, that would be great.

 

2. In my opinion, the difference between subgroups and warps/wavefronts is that the size of subgroups can be altered (8/16/32) programmatically. Whereas, the size of warps/wavefronts are fixed.

 

The work-items within a sub-group can communicate and synchronize independently of work-items in other sub-groups, and sub-groups are therefore commonly mapped to SIMD hardware where it exists.

 

If you are doing the same operation, explicit sync is not needed in my opinion.

 

The function call for work-item synchronization within a subgroup is:

 

 

void barrier() const

 

 

 

 

Thanks,

Rahul

0 Kudos
yhmtsai
Beginner
2,029 Views

Hi @RahulV_intel ,

Thanks for quick reply.

I put the subwarp_reduce in the following, which is failed in converting.
In kernel, do I need to call barrier to make sure the data is loaded in each threads of subgroup when using intel subgroup?

 

 

#include <iostream>
#include <cooperative_groups.h>

__global__ __launch_bounds__(32) void reduce(float *a)  {
    auto subwarp = cooperative_groups::tiled_partition<16>(cooperative_groups::this_thread_block());
    auto local_data = a[threadIdx.x];
    // in intel subgroup, does it need to call barrier to make sure the data is already loaded?
    #pragma unroll
    for (int bitmask = 1; bitmask < subwarp.size(); bitmask <<= 1) {
        const auto remote_data = subwarp.shfl_xor(local_data, bitmask);
        local_data = local_data + remote_data;
    }
    a[threadIdx.x] = local_data;
}

int main() {
    float *d_A;
    float data[32];
    for (int i = 0; i < 32; i++) {
        // 0~15: 0, 16~31: 1
        data[i] = i/16;
    }
    cudaMalloc( &d_A, 32 * sizeof( float ) );
    cudaMemcpy(d_A, data, 32 * sizeof(float), cudaMemcpyHostToDevice);
    reduce<<<1, 32>>>(d_A);
    cudaMemcpy(data, d_A, 32*sizeof(float), cudaMemcpyDeviceToHost);
    bool passed = true;
    for (int i = 0; i < 32; i++) {
        if (i < 16) {
            passed &= (data[i] == 0);
        } else {
            passed &= (data[i] == 16);
        }
    }
    std::cout << "subwarp reduction is "<< (passed ? "passed" : "failed") << std::endl;
}

Thanks,
Mike

0 Kudos
RahulV_intel
Moderator
1,992 Views

Hi,


Like I said, explicit barrier is not required if the work-items of a sub-group are performing a similar task. It is also not required at the time of initialization.


Thanks for providing the CUDA source file. Some co-operative groups API are not supported currently. I've let this known to the concerned team.


Thanks,

Rahul


0 Kudos
yhmtsai
Beginner
1,950 Views

I did some manual implementation for this reduction code.
The following can be compiled by dpc++.

 

#include <CL/sycl.hpp>
#include <dpct/dpct.hpp>
#include <iostream>
template<int subgroup_size>
[[ cl::intel_reqd_sub_group_size(subgroup_size) ]] void reduce(float *a, sycl::nd_item<3> item_ct1) {
    auto subwarp = item_ct1.get_sub_group();
    auto local_data = a[item_ct1.get_local_id(2)];
    #pragma unroll
    for (int bitmask = 1; bitmask < subgroup_size; bitmask <<= 1) {
        const auto remote_data = subwarp.shuffle_xor(local_data, bitmask);
        local_data = local_data + remote_data;
    }
    a[item_ct1.get_local_id(2)] = local_data;
}
int main() {
    dpct::device_ext &dev_ct1 = dpct::get_current_device();
    sycl::queue &q_ct1 = dev_ct1.default_queue();
    float *d_A;
    float data[32];
    for (int i = 0; i < 32; i++) {
        // 0~15: 0, 16~31: 1
        data[i] = i/16;
    }
    d_A = sycl::malloc_device<float>(32, q_ct1);
    q_ct1.memcpy(d_A, data, 32 * sizeof(float)).wait();
    q_ct1.submit([&](sycl::handler &cgh) {
        cgh.parallel_for(sycl::nd_range<3>(sycl::range<3>(1, 1, 32),
                                           sycl::range<3>(1, 1, 32)),
                         [=](sycl::nd_item<3> item_ct1) {
                             reduce<8>(d_A, item_ct1);
                         });
    });
    q_ct1.memcpy(data, d_A, 32 * sizeof(float)).wait();
    bool passed = true;
    for (int i = 0; i < 32; i++) {
        if (i < 16) {
            passed &= (data[i] == 0);
        } else {
            passed &= (data[i] == 8);
        }
    }
    std::cout << "subwarp reduction is "<< (passed ? "passed" : "failed") << std::endl;
    return 0;
}

 


I only find the way to set the subgroup size by `[[ cl::intel_reqd_sub_group_size(subgroup_size)]]`.
putting it after function name is failed and`intel::reqd_sub_group_size` from https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc is not working.
Is there any other way to set the subgroup size? and it can only allow one kernel with one subgroup size.

Thanks,
Mike

0 Kudos
yhmtsai
Beginner
1,895 Views

beta09 fixes the keyword issue.
`intel::reqd_sub_group_size` is allowable in beta09

0 Kudos
Subarnarek_G_Intel
1,735 Views

Thank you for your confirmation. Closing this case.


0 Kudos
Subarnarek_G_Intel
1,735 Views

This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only


0 Kudos
Reply