Solved: Re:Example of performing kernel queries

Viet-Duc · ‎06-08-2021

Hi,

I am going through Intel official textbook: "Data Parallel C++"

https://www.apress.com/gp/book/9781484255735

I would like to tune the performance of nd_range kernel by optimize work group size. The two required paramers are (c.f. Chapter 12)

preferred_work_group_size
preferred_work_group_size_multiple

But I could not figure out how to do obtain them. To quote the book itself (page 366)

The kernel preferred_work_group_size_multiple query can be used to choose an efficient work-group size. Please refer to Chapter 12 for more information on how to query properties of a device

Unfortunately, chapter 12 simply brushes over the aformentioned parameters without any explanation on how to use them. Querying properties of device is not the issue here.

The most I could figure out is that we need to pass a kernel object to 'parallel_for'. Do we query the kernel before or after the command submission ? Please pardon my ignorance, but I am very confused coming from a CUDA background.

I am grateful if support team can provide a simple example using the vecter addition implementation to find 'preferred_work_group_size_multiple'

Regards.

Varsha_M_Intel · ‎07-09-2021

The preferred_work_group_size_multiple is a kernel query.

Returns a value, of which work-group size is preferred to be a multiple, for executing a kernel on a particular device. This is a performance hint. The value must be less than or equal to that returned by info::kernel_device_specific::work_group_size

Compiler uses certain heuristics depending on the kernel to determine this. And the number reported by clinfo "preferredworkgroupsizemultiple" might differ from this kernel query.

Hoe this answers your question.

View solution in original post

VidyalathaB_Intel · ‎06-09-2021

Hi,

Thanks for reaching out to us.

Regarding the usage of preferred_work_group_size_multiple:

In linux, we can directly get the preferred_work_group_size_multiple value by using clinfo command.

If you want to be more precise with the starting number used to choose multiples from,

query kernel_preferred_work_group_size_multiple using the following code snippet

const size_t max_device_work_group_size = kernel.get_work_group_info<sycl::info::kernel_work_group::preferred_work_group_size_multiple>(device)

>>can provide a simple example using the vector addition implementation to find 'preferred_work_group_size_multiple'

We are working on it, we will get back to you soon.

Regards,

Vidya.

Viet-Duc · ‎06-09-2021

Dear Vidya,

Thanks for suggesting clinfo. It is indeed a very useful tool.

For sake of reference, the following result was obtained from gen9 queue:

[E-2176G]

Max work item dimensions                        3
Max work item sizes                             8192x8192x8192
Max work group size                             8192
Preferred work group size multiple (kernel)     128
Max sub-groups per work group                   2048

[UHD P630]

Max work item dimensions                        3
Max work item sizes                             256x256x256
Max work group size                             256
Preferred work group size multiple (device)     32
Preferred work group size multiple (kernel)     32
Max sub-groups per work group                   32

Regarding the code snippet, that was also the only thing I could find while combing the internet. Still, it was not shown the context in which 'kernel' is defined. My gripe with C++ is that nothing is every straightforward.

I look forward to the full example.

Thanks for your time.

VidyalathaB_Intel · ‎06-14-2021

Hi,

>> I look forward to the full example.

Please find the implementation of preferred_work_group_size_multiple in the following code.

#include <CL/sycl.hpp>
#include <iostream>
#include <array>
using namespace cl::sycl;
int main()
{
        const size_t szKernelData = 1024;
        std::array<float, szKernelData> kernelData;
        range<1> r(szKernelData);
        queue q{gpu_selector()};
        program p(q.get_context());
        p.build_with_source(R"CLC( kernel void sinf_test(global float* data) {
                                data[get_global_id(0)] += 1 ;
                        } )CLC", "-cl-std=CL1.2");
        auto k = p.get_kernel("sinf_test");
        auto sz = k.get_work_group_info<info::kernel_work_group::preferred_work_group_size_multiple>(q.get_device());
        std::cout << sz << "\n";
        return 0;
}

Command Used:

dpcpp *.cpp && SYCL_BE=PI_OPENCL ./a.out

Regards,

Vidya.

Viet-Duc · ‎06-15-2021

Hi, Vidya

The code compiled and produce same results with clinfo.

Now I understood that the kernel must be compiled before passing it to parallel for.

In this sense, preferred_work_group_size_multiple is a constant carrying similar meaning as CUDA warp.

Could you confirm whether my understand is correct ?

I was confused as the book had implied that preferred_work_group_size_multiple was a kernel dependent variable.

After running the kernel once, the aformentioned value will changes from 64 to a different value.

Thanks.

VidyalathaB_Intel · ‎06-18-2021

Hi,

>> After running the kernel once, the aformentioned value will changes from 64 to a different value.

Could you please let us know what values are you getting with different runs ?

>> In this sense, preferred_work_group_size_multiple is a constant carrying similar meaning as CUDA warp.

Regarding this we will get back to you soon.

Thanks & Regards,

Vidya

Viet-Duc · ‎06-20-2021

Sorry the sentence came out wrong. I meant to explain what the book is implying.

The result of running the code is always a constant. There two separated values for preferred_work_group_size_multiple (UHD P630)

Preferred work group size multiple (device)     32
Preferred work group size multiple (kernel)     32

I just want to make sure that 'kernel' one is a hardware-based constant regardless of the nature of kernel

This way, we won't need to do runtime check for each kernel of different size.

Sorry for causing confusion.

Varsha_M_Intel · ‎07-09-2021

The preferred_work_group_size_multiple is a kernel query.

Returns a value, of which work-group size is preferred to be a multiple, for executing a kernel on a particular device. This is a performance hint. The value must be less than or equal to that returned by info::kernel_device_specific::work_group_size

Compiler uses certain heuristics depending on the kernel to determine this. And the number reported by clinfo "preferredworkgroupsizemultiple" might differ from this kernel query.

Hoe this answers your question.

Viet-Duc · ‎07-12-2021

Thanks for your clarification.

JyotsnaK_Intel · ‎08-05-2021

Thanks for accepting our solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.