Inquiry Regarding Inconsistent Results Despite [[intel::reqd_sub_group_size(32)]] Specification

-Light- · ‎05-21-2024

I recently migrated a CUDA project to SYCL and encountered different results between debug mode and release mode when running in Visual Studio. After investigating, I found that the difference occurs in the "get_sub_group()" function.

Here's a snippet of code I used for testing:

std::cout << "device name : " << device.get_name() << std::endl;//device name: Intel(R) Arc(TM) A370M Graphics
std::cout << "Suppose Sub-group Sizes: ";
for （const auto& s ： dev_ct1.get_info<sycl：：info：:d evice：：sub_group_sizes>（）） {
std：：cout << s << “ ”;
}
std::cout << std::endl;//Suppose Sub-group Sizes: 8 16 32

sycl：：queue& q = dev_ct1.in_order_queue（）;
q.submit（[&]（sycl：：handler& cgh） {
sycl：：stream out（1024 * 1024， 256， cgh）;
cgh.parallel_for（
sycl：：nd_range<3>（sycl：：range<3>（1， 1， 32） *
sycl：：range<3>（1， 1， 256），
sycl：：range<3>（1， 1， 256）），
[=]（sycl：：nd_item<3> item_ct1）
[[intel：：reqd_sub_group_size（32）]] {
out << "Used Sub-group Sizes: " << item_ct1.get_sub_group().get_local_range() << sycl::endl; });
});
});

When running in debug mode (without code optimization), the output is 16. However, when running in release mode (code optimization level of O1 or O2), the output is 32.

Although the desired subgroup size is set to 32 using [intel::reqd_sub_group_size(32)], the output still differs between debug and release modes.

Thank you for your help.

Sincerely

yzh_intel · ‎03-03-2025

Hi, just wondering if you're still seeing the issue ? I tested on a data center gpu max 1100, but couldn't reproduce it...