Re: MultiKernel design with oneAPI

JRome28 · ‎06-21-2020

Hi, I have a multikernel design in OpenCL, where in a single kernel.cl file we include several copies of the same kernel. This way we can replicate a little kernel in order to ocupe more FPGA and speedup our application by replicating the kernel in the bitstream .aocx. In host side I call in runtime each kernel with a different queue and so on. I want to replicate the same with oneAPI. The problem here is that we do not have a "kernel file" with a .cl where we only have the kernel function to replicate.

I have followed an example from BaseKit code examples in oneAPI. Inside, in FPGATutorials we have an example of vector_add to compile the kernel in a different source file (kernel.cpp) from the host.cpp. This kernel.cpp is in pseudocode:

//Inside kernel.cpp
 
void run_kernel(arguments){

    queue->submit(handler h){

        h.single_task<>([]{
        //Kernel code
       }
    }
}

Mi question is that I do not know exactly how to replicate this kernel that I am execution inside the singletask. I want to replicate the kernel around 10 times un order to fully ocupe the Intel Arria10 FPGA from DevCloud.

I do not know from what level of code I have to replicate: the entire "run_kernel" function, the "h.single_task" or the "queue->submit" call.

I hope anyone can throw me some light in this question.

Thanks a lot!

KennyTan_Altera · ‎06-22-2020

Check this https://software.intel.com/content/www/us/en/develop/download/oneapi-fpga-optimization-guide.html

Page 82. thanks

JRome28 · ‎06-22-2020

Hi Kenny, thank you very much for your help and your quick answer. That link helps me a lot, because I had a older version of the oneAPI FPGA optimization guide and I didn't know there was updates! Thanks for share.

I have been reading the page you told me and now I have some more idea but still not know how to solve my issue. I you don't mind, I am going to extend a little bit my question with more pseudocode to help explaining myself.

In OpenCL I have in the device file, kernel.cl the following kernels which once I compile, all of them are placed in the same bitstream file kernel.aocx (I have put in the pseudocode the opencl calls in host side):

//Inside kernel.cl
 
void Kernel1( Arguments #1){
//Some code
}
 
void Kernel2(Arguments#2){
//Some code
}
 
void Kernel3(Arguments#3){
//Some code
}
 
 
 
//In host side main.cpp
 
for (some iterations){
 
    cl_write_buffers(Arguments #1);
    setKernelArguments (Arguments #1);
    clEnqueueTask(Kernel1);
    cl_readbuffers(Arguments #1);
 
    cl_write_buffers(Arguments #2);
    setKernelArguments (Arguments #2);
    clEnqueueTask(Kernel2);
    cl_readbuffers(Arguments #2);
 
 
    cl_write_buffers(Arguments #3);
    setKernelArguments (Arguments #3);
    clEnqueueTask(Kernel3);
    cl_readbuffers(Arguments #3);
 
}

In this code, I have read the .aocx and I call in each iterations the different kernels placed in the FPGA. This way I can call different kernels without changing the bitstream (without paying that delay in my execution time) and make best use of the FPGA's resources.

I want to code the same in oneAPI. I have the host part (main.cpp) and one single kernel inside kernel.cpp. I call the kernel as a function in host side as the following:

//Inside kernel.cpp
 
void run_kernel(arguments){

    queue->submit(handler h){

        h.single_task<>([]{
        //Kernel code
       }
    }
}
 
 
//In main.cpp
 
 
for (some iterations){
 
    h.memcpy(data_device, data_host);
    run_kernel(arguments);
    h.memcpy(data_host, data_device);
}

The problem I see here is that I do not control when the "bitstream .aocx" is placed in the FPGA like I did in OpenCL so I do not have the control that if I call a second o third kernel, The bitstream is going to be placed only once and the different kernels distributed among the FPGA surface. I want to replicate my kernel as many times as they fill in the FPGA surface in order to parallelize and speedup my execution. So I do not want to make the mistake that when I call different kernels, the bitstream placed in the FPGA is being changed because each kernel is placed in a different one.

I supose that what I have to do is like I did with OpenCL, replicate my kernels inside the kernel.cpp and just call him from the host side:

//Inside kernel.cpp
 
void run_kernel1(arguments){

    queue->submit(handler h){

        h.single_task<>([]{
        //Kernel code
       }
    }
}
 
void run_kernel2(arguments){

    queue->submit(handler h){

        h.single_task<>([]{
        //Kernel code
       }
    }
}
 
void run_kernel3(arguments){

    queue->submit(handler h){

        h.single_task<>([]{
        //Kernel code
       }
    }
}
 
 
 
 
//In main.cpp
 
 
for (some iterations){
 
    h.memcpy(data_device, data_host);
    run_kernel1(arguments);
    h.memcpy(data_host, data_device);
 
     h.memcpy(data_device, data_host);
    run_kernel2(arguments);
    h.memcpy(data_host, data_device);
 
     h.memcpy(data_device, data_host);
    run_kernel3(arguments);
    h.memcpy(data_host, data_device);
 
 
 
}

My question here is that I do not know if doing this I have the certain that the 3 kernels are placed at the same time in the FPGA and , if I call the three of them in host, they are running in parallel without changing the FPGA's bitstream for each kernel.

Thank you very much for your feedback!

KennyTan_Altera · ‎06-25-2020

Here is some confusion that needs to be cleared up.

The memcpy function is part of the handler class, which you can find in handler.hpp source code that ships with the tool. This function isn’t documented because we recommend using buffer + accessor types in the oneAPI Programming Guide to move data between host and device. So, the first piece of advice to the user is to stop using memcpy.
If the source code containing multiple kernels was built into a single binary, then there shouldn’t be any worries about each kernel being placed in a different bitstream. The fat binary that’s used to run the workload contains only one bitstream with all kernels that it was compiled with.

In order to create a replicated kernel in oneAPI, it’s best to create a templated function and call it with different parameters from the main code. An example of this is attached. That’s preferable to duplicating the same kernel code. You can also pass in different buffers to each templated function call to make sure each kernel operates on different data

asenjo · ‎06-26-2020

Thanks a lot Kenny. But regarding this "An example of this is attached", I can not see where your example is. Thanks once again.

KennyTan_Altera · ‎06-26-2020

Seems like there is a problem on attaching a files from my side. You may check your email for the attachment. Thanks.

JRome28 · ‎06-26-2020

Thank you very much for your help. Let me take a look to your file and I write you back.

I think this is more or less what I want to do but I need to analyse it to be sure.

Thanks!

asenjo · ‎06-27-2020

Got it! That's right what we needed. Thanks a lot!