Macro to target specific backend in kernel

VinInn · ‎03-02-2022

Is there any specific oneAPI macro (or other compile time construct as macros seem to be interpreted at dpcpp step) to specifically target a given backend/platform (gpu, cpu) in a kernel (JIT compiled)?

something like

#ifdef ON_CPU

// some code that does not compile on GPU (or is optimized for CPU)

#elif ON_GPU

// an alternative for the GPU

#else

// throw?

#endif

NoorjahanSk_Intel · ‎03-03-2022

Hi,

Thanks for reaching out to us.

>>Is there any specific oneAPI macro..

Please find the sample code and steps we have followed to target specific devices from our end.

vec_add.cpp:

#include <CL/sycl.hpp>
#include <iostream>
#include <vector>
using namespace sycl;
#define size 1024

int main() {
std::vector<int> A(size, size), B(size, size), C(size, 0);
{ 
queue cpuQ(cpu_selector{});
queue gpuQ(gpu_selector{});
range<1> R(size);
buffer<int,1> buffA(A.data(), R);
buffer<int,1> buffB(B.data(), R);
buffer<int,1> buffC(C.data(), R);
#ifdef CPU
std::cout<<"Running on: "<<cpuQ.get_device().get_info<sycl::info::device::name>()<<"\n";
cpuQ.submit([&](handler &cgh) {
auto acc_buffA = buffA.get_access<access::mode::read>(cgh);
auto acc_buffB = buffB.get_access<access::mode::read>(cgh);
auto acc_buffC = buffC.get_access<access::mode::write>(cgh);
cgh.parallel_for(R, [=](id<1> i) {
acc_buffC[i] = acc_buffA[i] + acc_buffB[i];
}
);
});
#elif GPU
std::cout<<"Running on: "<<gpuQ.get_device().get_info<sycl::info::device::name>()<<"\n";
gpuQ.submit([&](handler &cgh) {
auto acc_buffA = buffA.get_access<access::mode::read>(cgh);
auto acc_buffB = buffB.get_access<access::mode::read>(cgh);
auto acc_buffC = buffC.get_access<access::mode::write>(cgh);
cgh.parallel_for(R, [=](id<1> i) {
acc_buffC[i] = acc_buffA[i] + acc_buffB[i];
}
);
});
#endif
} 
std::vector<int> vecValidate(size, 2*size);
(C==vecValidate) ? std::cout << "Success\n" : std::cout<<"Failure\n";
return 0;
}

Command to execute:

dpcpp vec_add.cpp -DCPU -o cpuadd && ./cpuadd

dpcpp vec_add.cpp -DGPU -o gpuadd && ./gpuadd

Please find the attached screenshot for more details:

We use the -D option to define a macro name while compiling the source code.

Please refer to the below link for more details:

https://www.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-cpp-compiler-dev-guide-and-reference/top/compiler-reference/compiler-options/compiler-option-details/preprocessor-options/d.html

We can target specific backend using the SYCL_DEVICE_FILTER environment variable.

Syntax:

export SYCL_DEVICE_FILTER=backend:device_type:device_num

Possible values of backend are:

host

level_zero

opencl

Possible values of device_type are:

host

cpu

gpu

acc

device_num is an integer that indexes the enumeration of devices from the sycl-ls utility tool and will return all devices with an index from all different backends.

Please find below link for more details:

https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-programming-model/device-selection.html

Thanks & Regards,

Noorjahan

NoorjahanSk_Intel · ‎03-10-2022

HI,

Has the information provided above helped? If yes, Could you please confirm whether we can close this thread from our end?

Thanks & Regards,

Noorjahan.

VinInn · ‎03-10-2022

I understood that what I had in mind cannot be done,

will do otherwise.

thanks for the help.

Please close the thread.

NoorjahanSk_Intel · ‎03-11-2022

Hi,

>>I understood that what I had in mind cannot be done

Could you please elaborate more on your issue so that we will try to help you in another way?

Thanks & Regards,

Noorjahan.

VinInn · ‎03-11-2022

Say I have a kernel that I want to run on GPU and CPU but a small part I wish to use an optimized routine that will NOT compile in the other and I do not wnat (as in your example) to duplicate all the code and I do not want to split my kernel

this will not work.

auto kernel = [=](auto i,....) {

//

// a lot of code

//

#ifdef ONGPU

optimizedForGPU // may not parse on HOST

#else

optimizedForCPU

#endif

//

// more code

//

};

VinInn · ‎03-11-2022

In any case my questions were:

1) can macros be used in a sycl kernels to drive JIT compilation?

My understanding is: NO, they are parsed by the host compiler ahead-of-time (a in your example)

2) is there any other mechanism in sycl to compile part of a kernel for a specific target?

my understanding is: NO, there is no "special" mechanism in sycl to perform JIT conditional compilation.

Thanks again for the support and please close the thread.

NoorjahanSk_Intel · ‎03-18-2022

Hi,

>>Thanks again for the support. please close the thread

We have discussed this case with concerned team and they are interested to get some more details. So, could you please let us know is there any particular

use-case behind compiling part of the kernel on a specific target so that we could check the possibilities internally and will try to address your questions specifically?

Before proceeding further in closing this thread as per your request,

I think it would also become a good reference to anybody looking at this topic with similar thoughts as yours by providing your use-case.

Thanks & Regards,

Noorjahan.

VinInn · ‎03-18-2022

for instance in some kernel I run either on GPU or on CPU under cuda/nvcc at some point I need to perform a sort and I use CUDA_ARCH to select either a bucket-sort for GPU (single block) or quick sort on CPU (single thread).

NoorjahanSk_Intel · ‎03-28-2022

Hi,

Could you please provide us a sample reproducer with CUDA_ARCH, so that we can understand it better?

Thanks & regards,

Noorjahan.

VinInn · ‎03-28-2022

I am surprised this kind of issue has not been raised nor encountered yet

anyhow:

please find here an example

CUDA

https://github.com/cms-patatrack/pixeltrack-standalone/blob/master/src/cudacompat/plugin-PixelVertexFinding/gpuSortByPt2.h#L58

HIP
https://github.com/cms-patatrack/pixeltrack-standalone/blob/master/src/hip/plugin-PixelVertexFinding/gpuSortByPt2.h#L59

Alpaka

https://github.com/cms-patatrack/pixeltrack-standalone/blob/master/src/alpaka/plugin-PixelVertexFinding/alpaka/gpuSortByPt2.h#L56

NoorjahanSk_Intel · ‎04-04-2022

Hi,

We are working on your issue. We will get back to you soon.

Could you please confirm whether your are using nvrtc Cuda runtime library in your application?

Thanks & Regards,

Noorjahan.

NoorjahanSk_Intel · ‎04-13-2022

Hi,

We have reported this issue to the development team, they are looking into this issue.

Thanks & Regards,

Noorjahan.