Migrating to SYCL
One-stop forum for getting assistance migrating your existing code to SYCL
48 ディスカッション

Question about max group size

ManuelCostanzo2
初心者
5,011件の閲覧回数

Hi everyone! I have a question about migrating this CUDA kernel:

 

kernel<<< blocks, threads >>>(...);

 

DPCT migrates this kernel as:

 

cgh.parallel_for(
sycl::nd_range<3>(sycl::range<3>(1, 1, blocks) *
sycl::range<3>(1, 1, threads),
sycl::range<3>(1, 1, threads)),
[=](sycl::nd_item<3> item_ct1) {
   kernel(...);
});
 
But DPCT alerts:

DPCT1049:55: The workgroup size passed to the SYCL kernel may
exceed the limit. To get the device limit, query
info::device::max_work_group_size. Adjust the workgroup size if
needed.
 
 
So, how can I ask for the maximum blocks and threads ? Thank you so much!
0 件の賞賛
11 返答(返信)
ShwethaS_Intel
モデレーター
4,951件の閲覧回数

Hi,

 

Thanks for reaching out to us.

 

>> So, how can I ask for the maximum blocks and threads ?

To check supported max work group size or any other info related to your device, you may run 'clinfo' command in your terminal.

 

For more info on the DPCT alert, kindly refer to this link: https://www.intel.com/content/www/us/en/develop/documentation/intel-dpcpp-compatibility-tool-user-guide/top/diagnostics-reference/dpct1049.html#dpct1049_id-dpct1049

 

Regards,

Shwetha

 

 

 

ManuelCostanzo2
初心者
4,948件の閲覧回数

Hey! Thank you. Using that I know the maximum group size. How can I get the maximum number of threads per group size? I didn't find that

ShwethaS_Intel
モデレーター
4,910件の閲覧回数

Hi,


"clinfo" command gives all the necessary info related to your device through a command line.

 

Max work group size = max no. of threads allowed in CUDA's block

Max work item size = max no. of threads allowed in CUDA's grid

(Max work item size / Max work group size) = max no. of blocks allowed in CUDA's grid at any given instance

 


The same info can be obtained from the DPC++ API using :

  1. "queue.get_info<device::detail::max_work_group_size>()" - The maximum number of work-items that are permitted in a work-group executing a kernel on a single compute unit.
  2. "queue.get_info<device::detail::max_work_item_sizes>()" - The maximum number of work-items that are permitted in each dimension of the work-group of the nd_range.

 

Thanks & Regards,

Shwetha


ShwethaS_Intel
モデレーター
4,862件の閲覧回数

Hi,


A gentle remainder to respond.


Regards,

Shwetha


ManuelCostanzo2
初心者
4,842件の閲覧回数

Hey! Thank you so much for your help.

 

with max_work_group_size I'm getting 1024.

with max_work_item_sizes I'm getting (64, 1024, 1024)

 

So, my kernel looks like this 

cgh.parallel_for(
sycl::nd_range<3>(sycl::range<3>(1, 1, blocks) *
sycl::range<3>(1, 1, threads),
sycl::range<3>(1, 1, threads)),
[=](sycl::nd_item<3> item_ct1) {
   kernel(...);
});
 
if I want to send the maximum blocks, I cant divide 1024/1024, what would be the idea? Sorry
ManuelCostanzo2
初心者
4,842件の閲覧回数

Hi, thank you so much for your help.

 

max_work_group_size is 1024

max_work_item_sizes is (64, 1024, 1024)

 

With my kernel:

 

cgh.parallel_for(
sycl::nd_range<3>(sycl::range<3>(1, 1, blocks) *
sycl::range<3>(1, 1, threads),
sycl::range<3>(1, 1, threads)),
[=](sycl::nd_item<3> item_ct1) {
   kernel(...);
});
 
How can I do the division that you said ? Because if I cant do 1024/1024. Sorry
ShwethaS_Intel
モデレーター
4,824件の閲覧回数

Hi,

 

Small correction to our previous understanding.

 

Max work group size = Maximum number of threads allowed per block.

Max work item size    = Maximum number of threads allowed in each dimensions.

 

This implies that at any given moment, the maximum number of threads within the work group shouldn't exceed max work group size.

 

To calculate maximum number of blocks per grid,

Max no. of blocks = Max Threads / threads requested by user.

And Maximum threads can be calculated by,

Max Threads = Max compute unit * max work group size

 

For further more details please refer the code below and attached output snapshot for both GPU and CPU device.

 

 

#include <CL/sycl.hpp>

int main()
{
    sycl::queue q_ct1 = sycl::queue();
    
    auto device = q_ct1.get_device();
    auto max_work_group_size      = device.get_info<cl::sycl::info::device::max_work_group_size>();
    auto max_work_item_dimensions = device.get_info<cl::sycl::info::device::max_work_item_dimensions>();
    auto max_work_item_sizes      = device.get_info<cl::sycl::info::device::max_work_item_sizes>(); 
    auto max_compute_units        = device.get_info<cl::sycl::info::device::max_compute_units>();
    
         
    std::string d_name = device.get_info<cl::sycl::info::device::name>();
    
    std::cout << "Device: " << d_name << std::endl;
    std::cout << "Max work Group size      = " << max_work_group_size << std::endl;
    std::cout << "Max work Item dimensions = " << max_work_item_dimensions << std::endl;
    std::cout << "Max work Item size       = " << max_work_item_sizes[0] << " " << max_work_item_sizes[1] << " " << 
        max_work_item_sizes[2] << std::endl;
    std::cout << "Max Compute Units        = " << max_compute_units << std::endl;
    
    int requested_threads = 256;
    
    int max_threads = max_compute_units * max_work_group_size;
    int max_blocks = max_threads / requested_threads;
    
    std::cout << std::endl;
    std::cout << "Max threads allowed per block = " << max_work_group_size << std::endl;
    std::cout << "Max blocks allowed per grid = " << max_blocks << " (at a given inst, when " << 
                requested_threads << " are requested per block)" << std::endl;
    
    return 0;
}

 

ShwethaS_Intel_0-1643879577207.png

 

 

Thanks & Regards,

Shwetha.

 

ManuelCostanzo2
初心者
4,803件の閲覧回数

Hi @ShwethaS_Intel, thank you so much for your help.

 

The only doubt I have is about "max_compute_units". In GPUs, is this number related on what we are working? Because I read about this and it seems like is the number of SM in GPUs, so it's not related to threads or blocks, is that right? If so, how could we modify the code to match for both GPU and CPU? 

 

Thank you again

ShwethaS_Intel
モデレーター
4,769件の閲覧回数

Hi @ManuelCostanzo2 ,

 

>> The only doubt I have is about "max_compute_units".

YES, Max_compute_unit is equivalent to number of SM's in GPU and it is required to calculate the Maximum number of threads.

 

>> how could we modify the code to match for both GPU and CPU? 

It's up to the description of the user, when launching particular threads, query about the device info to verify appropriate number of threads/blocks to be launched and then set the limit, this way the code can be modified for both CPU and GPU.

 

Hope these details will help you to resolve your queries.

 

Thanks & Regards,

Shwetha.

 

 

ShwethaS_Intel
モデレーター
4,723件の閲覧回数

Hi,


Has the information provided helped?

If this resolves your issue, make sure to accept this as a solution. Thank you!


Regards,

Shwetha.


ShwethaS_Intel
モデレーター
4,685件の閲覧回数

Hi,

 

I have not heard back from you. This thread will no longer be monitored by Intel.

If you need further assistance, please post a new question.

 

Thanks & Regards,

Shwetha.






返信