Community
cancel
Showing results for 
Search instead for 
Did you mean: 
yhmtsai
Beginner
600 Views

dpct subgroup and cooperative group (DPCT1007)

Dear all,

1. When I try converting cuda to dpc++ by dpct on devcloud, dpct reports the warning on cooperative_group.

DPCT1007:0: Migration of this CUDA API is not supported by the Intel(R)
DPC++ Compatibility Tool.

In DPCT beta 7 update, it adds cooperative group api from (https://software.intel.com/content/www/us/en/develop/articles/release-notes-for-intel-dpcpp-compatib...)

Is anything I miss here?
the dpct version is 
Intel(R) DPC++ Compatibility Tool Version: 2021.1-beta08 codebase:(e0b12aa57579014d41e1cd86ecbaaee7de878ce8)

2. is subgroup equal to the warp in CUDA or wavefront in AMD with different size?
subgroup only support size = 8/16/32. does it keep consistent in all intel gpu?
I do not need any sync in subgroup if all operations are in the same subgroup, is it correct?


Thanks,
mike

Tags (1)
0 Kudos
7 Replies
RahulV_intel
Moderator
582 Views

Hi,

 

1. CUDA Co-operative groups is only partially supported by DPCT at the moment (Not all Co-operative groups' APIs are supported) . If you could send me the exact API call that is unsupported or preferably a minimal reproducible sample code which includes this API call, that would be great.

 

2. In my opinion, the difference between subgroups and warps/wavefronts is that the size of subgroups can be altered (8/16/32) programmatically. Whereas, the size of warps/wavefronts are fixed.

 

The work-items within a sub-group can communicate and synchronize independently of work-items in other sub-groups, and sub-groups are therefore commonly mapped to SIMD hardware where it exists.

 

If you are doing the same operation, explicit sync is not needed in my opinion.

 

The function call for work-item synchronization within a subgroup is:

 

 

void barrier() const

 

 

 

 

Thanks,

Rahul

yhmtsai
Beginner
574 Views

Hi @RahulV_intel ,

Thanks for quick reply.

I put the subwarp_reduce in the following, which is failed in converting.
In kernel, do I need to call barrier to make sure the data is loaded in each threads of subgroup when using intel subgroup?

 

 

#include <iostream>
#include <cooperative_groups.h>

__global__ __launch_bounds__(32) void reduce(float *a)  {
    auto subwarp = cooperative_groups::tiled_partition<16>(cooperative_groups::this_thread_block());
    auto local_data = a[threadIdx.x];
    // in intel subgroup, does it need to call barrier to make sure the data is already loaded?
    #pragma unroll
    for (int bitmask = 1; bitmask < subwarp.size(); bitmask <<= 1) {
        const auto remote_data = subwarp.shfl_xor(local_data, bitmask);
        local_data = local_data + remote_data;
    }
    a[threadIdx.x] = local_data;
}

int main() {
    float *d_A;
    float data[32];
    for (int i = 0; i < 32; i++) {
        // 0~15: 0, 16~31: 1
        data[i] = i/16;
    }
    cudaMalloc( &d_A, 32 * sizeof( float ) );
    cudaMemcpy(d_A, data, 32 * sizeof(float), cudaMemcpyHostToDevice);
    reduce<<<1, 32>>>(d_A);
    cudaMemcpy(data, d_A, 32*sizeof(float), cudaMemcpyDeviceToHost);
    bool passed = true;
    for (int i = 0; i < 32; i++) {
        if (i < 16) {
            passed &= (data[i] == 0);
        } else {
            passed &= (data[i] == 16);
        }
    }
    std::cout << "subwarp reduction is "<< (passed ? "passed" : "failed") << std::endl;
}

Thanks,
Mike

RahulV_intel
Moderator
537 Views

Hi,


Like I said, explicit barrier is not required if the work-items of a sub-group are performing a similar task. It is also not required at the time of initialization.


Thanks for providing the CUDA source file. Some co-operative groups API are not supported currently. I've let this known to the concerned team.


Thanks,

Rahul


yhmtsai
Beginner
495 Views

I did some manual implementation for this reduction code.
The following can be compiled by dpc++.

 

#include <CL/sycl.hpp>
#include <dpct/dpct.hpp>
#include <iostream>
template<int subgroup_size>
[[ cl::intel_reqd_sub_group_size(subgroup_size) ]] void reduce(float *a, sycl::nd_item<3> item_ct1) {
    auto subwarp = item_ct1.get_sub_group();
    auto local_data = a[item_ct1.get_local_id(2)];
    #pragma unroll
    for (int bitmask = 1; bitmask < subgroup_size; bitmask <<= 1) {
        const auto remote_data = subwarp.shuffle_xor(local_data, bitmask);
        local_data = local_data + remote_data;
    }
    a[item_ct1.get_local_id(2)] = local_data;
}
int main() {
    dpct::device_ext &dev_ct1 = dpct::get_current_device();
    sycl::queue &q_ct1 = dev_ct1.default_queue();
    float *d_A;
    float data[32];
    for (int i = 0; i < 32; i++) {
        // 0~15: 0, 16~31: 1
        data[i] = i/16;
    }
    d_A = sycl::malloc_device<float>(32, q_ct1);
    q_ct1.memcpy(d_A, data, 32 * sizeof(float)).wait();
    q_ct1.submit([&](sycl::handler &cgh) {
        cgh.parallel_for(sycl::nd_range<3>(sycl::range<3>(1, 1, 32),
                                           sycl::range<3>(1, 1, 32)),
                         [=](sycl::nd_item<3> item_ct1) {
                             reduce<8>(d_A, item_ct1);
                         });
    });
    q_ct1.memcpy(data, d_A, 32 * sizeof(float)).wait();
    bool passed = true;
    for (int i = 0; i < 32; i++) {
        if (i < 16) {
            passed &= (data[i] == 0);
        } else {
            passed &= (data[i] == 8);
        }
    }
    std::cout << "subwarp reduction is "<< (passed ? "passed" : "failed") << std::endl;
    return 0;
}

 


I only find the way to set the subgroup size by `[[ cl::intel_reqd_sub_group_size(subgroup_size)]]`.
putting it after function name is failed and`intel::reqd_sub_group_size` from https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc is not working.
Is there any other way to set the subgroup size? and it can only allow one kernel with one subgroup size.

Thanks,
Mike

yhmtsai
Beginner
440 Views

beta09 fixes the keyword issue.
`intel::reqd_sub_group_size` is allowable in beta09

280 Views

Thank you for your confirmation. Closing this case.


280 Views

This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only


Reply