- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am going through Intel official textbook: "Data Parallel C++"
https://www.apress.com/gp/book/9781484255735
I would like to tune the performance of nd_range kernel by optimize work group size. The two required paramers are (c.f. Chapter 12)
preferred_work_group_size
preferred_work_group_size_multiple
But I could not figure out how to do obtain them. To quote the book itself (page 366)
The kernel preferred_work_group_size_multiple query can be used to choose an efficient work-group size. Please refer to Chapter 12 for more information on how to query properties of a device
Unfortunately, chapter 12 simply brushes over the aformentioned parameters without any explanation on how to use them. Querying properties of device is not the issue here.
The most I could figure out is that we need to pass a kernel object to 'parallel_for'. Do we query the kernel before or after the command submission ? Please pardon my ignorance, but I am very confused coming from a CUDA background.
I am grateful if support team can provide a simple example using the vecter addition implementation to find 'preferred_work_group_size_multiple'
Regards.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The preferred_work_group_size_multiple is a kernel query.
Returns a value, of which work-group size is preferred to be a multiple, for executing a kernel on a particular device. This is a performance hint. The value must be less than or equal to that returned by info::kernel_device_specific::work_group_size
Compiler uses certain heuristics depending on the kernel to determine this. And the number reported by clinfo "preferredworkgroupsizemultiple" might differ from this kernel query.
Hoe this answers your question.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for reaching out to us.
Regarding the usage of preferred_work_group_size_multiple:
In linux, we can directly get the preferred_work_group_size_multiple value by using clinfo command.
If you want to be more precise with the starting number used to choose multiples from,
query kernel_preferred_work_group_size_multiple using the following code snippet
const size_t max_device_work_group_size = kernel.get_work_group_info<sycl::info::kernel_work_group::preferred_work_group_size_multiple>(device)
>>can provide a simple example using the vector addition implementation to find 'preferred_work_group_size_multiple'
We are working on it, we will get back to you soon.
Regards,
Vidya.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Vidya,
Thanks for suggesting clinfo. It is indeed a very useful tool.
For sake of reference, the following result was obtained from gen9 queue:
[E-2176G]
Max work item dimensions 3
Max work item sizes 8192x8192x8192
Max work group size 8192
Preferred work group size multiple (kernel) 128
Max sub-groups per work group 2048
[UHD P630]
Max work item dimensions 3
Max work item sizes 256x256x256
Max work group size 256
Preferred work group size multiple (device) 32
Preferred work group size multiple (kernel) 32
Max sub-groups per work group 32
Regarding the code snippet, that was also the only thing I could find while combing the internet. Still, it was not shown the context in which 'kernel' is defined. My gripe with C++ is that nothing is every straightforward.
I look forward to the full example.
Thanks for your time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
>> I look forward to the full example.
Please find the implementation of preferred_work_group_size_multiple in the following code.
#include <CL/sycl.hpp>
#include <iostream>
#include <array>
using namespace cl::sycl;
int main()
{
const size_t szKernelData = 1024;
std::array<float, szKernelData> kernelData;
range<1> r(szKernelData);
queue q{gpu_selector()};
program p(q.get_context());
p.build_with_source(R"CLC( kernel void sinf_test(global float* data) {
data[get_global_id(0)] += 1 ;
} )CLC", "-cl-std=CL1.2");
auto k = p.get_kernel("sinf_test");
auto sz = k.get_work_group_info<info::kernel_work_group::preferred_work_group_size_multiple>(q.get_device());
std::cout << sz << "\n";
return 0;
}
Command Used:
dpcpp *.cpp && SYCL_BE=PI_OPENCL ./a.out
Regards,
Vidya.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Vidya
The code compiled and produce same results with clinfo.
Now I understood that the kernel must be compiled before passing it to parallel for.
In this sense, preferred_work_group_size_multiple is a constant carrying similar meaning as CUDA warp.
Could you confirm whether my understand is correct ?
I was confused as the book had implied that preferred_work_group_size_multiple was a kernel dependent variable.
After running the kernel once, the aformentioned value will changes from 64 to a different value.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
>> After running the kernel once, the aformentioned value will changes from 64 to a different value.
Could you please let us know what values are you getting with different runs ?
>> In this sense, preferred_work_group_size_multiple is a constant carrying similar meaning as CUDA warp.
Regarding this we will get back to you soon.
Thanks & Regards,
Vidya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry the sentence came out wrong. I meant to explain what the book is implying.
The result of running the code is always a constant. There two separated values for preferred_work_group_size_multiple (UHD P630)
Preferred work group size multiple (device) 32
Preferred work group size multiple (kernel) 32
I just want to make sure that 'kernel' one is a hardware-based constant regardless of the nature of kernel
This way, we won't need to do runtime check for each kernel of different size.
Sorry for causing confusion.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The preferred_work_group_size_multiple is a kernel query.
Returns a value, of which work-group size is preferred to be a multiple, for executing a kernel on a particular device. This is a performance hint. The value must be less than or equal to that returned by info::kernel_device_specific::work_group_size
Compiler uses certain heuristics depending on the kernel to determine this. And the number reported by clinfo "preferredworkgroupsizemultiple" might differ from this kernel query.
Hoe this answers your question.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your clarification.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for accepting our solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page