Solved: Calling a function in kernel scope?

grypp · ‎05-15-2020

Hi All,

For GPU targets, can I call a function in kernel scope? I have functions codes that are in different files. I want to call them inside parallel_for ?

Could you please show me an example?

Thanks in advance

AbhishekD_Intel · ‎05-19-2020

Hi,

I understand what you want to do. You want to compile both of your file ie function file and main file separately and wants to link the function defined into functions file into the kernel of the main file. It is very much doable, please follow the below steps to do this:

Create a header file declaring those functions. You can also see the sample below and add the SYCL_EXTERNAL attribute to it.

(kernel.h)

#pragma once
#include<CL/sycl.hpp>

extern SYCL_EXTERNAL void vectorAdd(const float *A, const float *B, float *C, int numElement, cl::sycl::item<1> item_ct1);

Include this header file(kernel.h) into your main file like in our case main.cpp and also in the functions file ie kernel.cpp file. Compile both file main.cpp and kernel.cpp separately and link them to generate executable, this will work.

Do let us know if you face any problem while following the above steps.

Warm Regards,

Abhishek

View solution in original post

AbhishekD_Intel · ‎05-18-2020

Hi,

Yes, you can call a function inside parallel_for() from a different file.

There is a couple of ways through you can achieve this I have given examples of them below you can check it out.

You can define the function definition in one file(kernel.cpp) and include that file in your main file(main.cpp) to access those functions into the main file and shown below:

(main.cpp)

#include <CL/sycl.hpp>
#include <iostream>
#include "kernel.cpp"
#define numElements 10

using namespace std;
int main(void){

        size_t size = numElements * sizeof(float);
        // Allocate the host vectors
        float *A = (float *)malloc(size);
        float *B = (float *)malloc(size);
        float *C = (float *)malloc(size);

        for (int i = 0; i < numElements; ++i)
        {
                A = i;//rand()/(float)RAND_MAX;
                B = i;//rand()/(float)RAND_MAX;
        }

        float *d_A, *d_B, *d_C;

        cl::sycl::queue queue( cl::sycl::gpu_selector{});
std::cout << "Running on " << queue.get_device().get_info<cl::sycl::info::device::name>() << "\n";

        cl::sycl::device dev = queue.get_device();;
        cl::sycl::context ctx = queue.get_context();

        *((void **)&d_A) = cl::sycl::malloc_device(size, dev, ctx);
        *((void **)&d_B) = cl::sycl::malloc_device(size, dev, ctx);
        *((void **)&d_C) = cl::sycl::malloc_device(size, dev, ctx);

        memcpy((void*)(d_A), (void*)(A), size);
        memcpy((void*)(d_B), (void*)(B), size);


        {
          queue.submit(
            [&](cl::sycl::handler &cgh) {
              cgh.parallel_for<class vectorAdd_e83213>(
                cl::sycl::range<1>{numElements}, [=](cl::sycl::item<1> item_ct1) {
                  vectorAdd(d_A, d_B, d_C, numElements, item_ct1); //defined in other file
                });
            });
        }

        queue.wait();
        memcpy((void*)(C), (void*)(d_C), size);
        for(int i=0;i<numElements;i++)
                cout<<A<<" "<<B<<" "<<C<<" "<<endl;

        cout<<endl;


        free(A);
        free(B);
        free(C);
        return 0;

}

(kernel.cpp)

#include <CL/sycl.hpp>
// Device kernel
 void vectorAdd(const float *A, const float *B, float *C, int numElement, cl::sycl::item<1> item_ct1)
{
        int i = item_ct1.get_linear_id();
        if (i < numElement)
        {
                C = A + B;
        }
}

You can define __kernel in another file (kernel.cl) and can read the whole file into a buffer. Then you can call get_kernel("function_name") of Kernel class with the context of your device queue. This is like calling cl kernel inside the DPCPP program.

(main.cpp)

#include<CL/sycl.hpp>
#include<iostream>
#include<fstream>
#include<string>
#define N 10
using namespace cl::sycl;
char* readCLFile(std::string f, unsigned int* size) {
    std::ifstream ifs(f);
    std::filebuf* fbuf = ifs.rdbuf();
    *size = fbuf->pubseekoff(0, ifs.end, ifs.in);
    fbuf->pubseekpos(0, ifs.in);
    char* buf = new char[*size];
    memset(buf, 0, sizeof(char) * (*size));
    fbuf->sgetn(buf, (*size));
    ifs.close();
    return buf;
}

int main() {
    int a,b,c;
    unsigned int size;
    //auto R = range<1>(N);
    for (int i = 0; i < N; i++) {
        a = i;
        b = i;
        c = 0;
    }
        
    char* buf = readCLFile("kernel.cl", &size);
    queue q(gpu_selector{});
    std::cout << "Running on " << q.get_device().get_info<cl::sycl::info::device::name>() << "\n";

    auto ctx = q.get_context();
    program p(ctx);
    p.build_with_source(std::string(buf, size));

    kernel k = p.get_kernel("vec_add");

    buffer<int, 1> bufa(a, range<1>(N) );
    buffer<int, 1> bufb(b, range<1>(N) );
    buffer<int, 1> bufc(c, range<1>(N) );

    q.submit([&](handler& h) {
        auto acc_a = bufa.get_access<access::mode::read>(h);
        auto acc_b = bufb.get_access<access::mode::read>(h);
        auto acc_c = bufc.get_access<access::mode::read_write>(h);

        h.set_args(acc_a, acc_b , acc_c);

        h.parallel_for(range<1>(N), k);
        });
    auto host_acc = bufc.get_access<access::mode::read>();
    for (int i = 0; i < N; i++)
        std::cout << host_acc << "\n";

    return 0;
}

(kernel.cl)

__kernel void vec_add(__global int* ptr1, __global int* ptr2, __global int* ptr3)
{
    int index = get_global_id(0);

    ptr3[index] = ptr1[index] + ptr2[index];
}

Hope this would have solved your problem.

Warm Regards,

Abhishek

grypp · ‎05-18-2020

Hi Abhishek,

Thanks. This is NOT what I am asking. Let me clarify my question. I wanted a call a CPU function which is in another file. I don't want to inline it. Also, I don't want to write a low-level OpenCL kernel for it.

I mean, if I remove "#include "kernel.cpp" in first code and compile two files separately, would it work?

AbhishekD_Intel · ‎05-19-2020

Hi,

I understand what you want to do. You want to compile both of your file ie function file and main file separately and wants to link the function defined into functions file into the kernel of the main file. It is very much doable, please follow the below steps to do this:

Create a header file declaring those functions. You can also see the sample below and add the SYCL_EXTERNAL attribute to it.

(kernel.h)

#pragma once
#include<CL/sycl.hpp>

extern SYCL_EXTERNAL void vectorAdd(const float *A, const float *B, float *C, int numElement, cl::sycl::item<1> item_ct1);

Include this header file(kernel.h) into your main file like in our case main.cpp and also in the functions file ie kernel.cpp file. Compile both file main.cpp and kernel.cpp separately and link them to generate executable, this will work.

Do let us know if you face any problem while following the above steps.

Warm Regards,

Abhishek

grypp · ‎05-19-2020

That's exactly what I was asking. Awesome thanks!

Follow-up questions

Is it possible to enable nested parallelism "parallel_for" in "vectorAdd" function?
Do I need pragma once?

AbhishekD_Intel · ‎05-19-2020

Hi,

Thanks for the confirmation. Good to know that our provided solution helps you.

Regarding your followup questions:

pragma once its a standard we use while working with header files. Because for a large application there might be the case compiler will get more than one initialization due to including those headers. So to avoid multiple initializations its good practice to use pragma once and it's up to you, whether you want to include it or not.
For details regarding Nested parallelism, I will suggest you post a new thread because this topic is very much different from the current topic. So to get detailed idea about Nested parallelism please go forward and post a new thread. We will definitely help you there.

Warm Regards,

Abhishek

GouthamK_Intel · ‎06-01-2020

Hi,

Glad to hear that solution provided helped.

Could you please raise a new thread for Nested Parallelism issue, giving details about your use case, so that we can make a feature request to the concerned team.

We are closing this thread.

Regards

Goutham