Intel® High Level Design
Support for Intel® High Level Synthesis Compiler, DSP Builder, OneAPI for Intel® FPGAs, Intel® FPGA SDK for OpenCL™
654 Discussions

MultiKernel design with oneAPI

JRome28
Novice
1,270 Views

Hi, I have a multikernel design in OpenCL, where in a single kernel.cl file we include several copies of the same kernel. This way we can replicate a little kernel in order to ocupe more FPGA and speedup our application by replicating the kernel in the bitstream .aocx. In host side I call in runtime each kernel with a different queue and so on. I want to replicate the same with oneAPI. The problem here is that we do not have a "kernel file" with a .cl where we only have the kernel function to replicate.

I have followed an example from BaseKit code examples in oneAPI. Inside, in FPGATutorials we have an example of vector_add to compile the kernel in a different source file (kernel.cpp) from the host.cpp. This kernel.cpp is in pseudocode:

 

//Inside kernel.cpp   void run_kernel(arguments){ ​ queue->submit(handler h){ ​ h.single_task<>([]{ //Kernel code } } }

 

Mi question is that I do not know exactly how to replicate this kernel that I am execution inside the singletask. I want to replicate the kernel around 10 times un order to fully ocupe the Intel Arria10 FPGA from DevCloud.

 

I do not know from what level of code I have to replicate: the entire "run_kernel" function, the "h.single_task" or the "queue->submit" call.

 

I hope anyone can throw me some light in this question.

 

Thanks a lot!

 

0 Kudos
7 Replies
Kenny_Tan
Moderator
1,257 Views
0 Kudos
JRome28
Novice
1,257 Views

Hi Kenny, thank you very much for your help and your quick answer. That link helps me a lot, because I had a older version of the oneAPI FPGA optimization guide and I didn't know there was updates! Thanks for share.

 

I have been reading the page you told me and now I have some more idea but still not know how to solve my issue. I you don't mind, I am going to extend a little bit my question with more pseudocode to help explaining myself.

 

In OpenCL I have in the device file, kernel.cl the following kernels which once I compile, all of them are placed in the same bitstream file kernel.aocx (I have put in the pseudocode the opencl calls in host side):

 

//Inside kernel.cl   void Kernel1( Arguments #1){ //Some code }   void Kernel2(Arguments#2){ //Some code }   void Kernel3(Arguments#3){ //Some code }       //In host side main.cpp   for (some iterations){   cl_write_buffers(Arguments #1); setKernelArguments (Arguments #1); clEnqueueTask(Kernel1); cl_readbuffers(Arguments #1);   cl_write_buffers(Arguments #2); setKernelArguments (Arguments #2); clEnqueueTask(Kernel2); cl_readbuffers(Arguments #2);     cl_write_buffers(Arguments #3); setKernelArguments (Arguments #3); clEnqueueTask(Kernel3); cl_readbuffers(Arguments #3);   }

In this code, I have read the .aocx and I call in each iterations the different kernels placed in the FPGA. This way I can call different kernels without changing the bitstream (without paying that delay in my execution time) and make best use of the FPGA's resources.

 

 

 

I want to code the same in oneAPI. I have the host part (main.cpp) and one single kernel inside kernel.cpp. I call the kernel as a function in host side as the following:

 

//Inside kernel.cpp void run_kernel(arguments){ ​ queue->submit(handler h){ ​ h.single_task<>([]{ //Kernel code } } }     //In main.cpp     for (some iterations){   h.memcpy(data_device, data_host); run_kernel(arguments); h.memcpy(data_host, data_device); }

The problem I see here is that I do not control when the "bitstream .aocx" is placed in the FPGA like I did in OpenCL so I do not have the control that if I call a second o third kernel, The bitstream is going to be placed only once and the different kernels distributed among the FPGA surface. I want to replicate my kernel as many times as they fill in the FPGA surface in order to parallelize and speedup my execution. So I do not want to make the mistake that when I call different kernels, the bitstream placed in the FPGA is being changed because each kernel is placed in a different one.

 

I supose that what I have to do is like I did with OpenCL, replicate my kernels inside the kernel.cpp and just call him from the host side:

 

//Inside kernel.cpp void run_kernel1(arguments){ ​ queue->submit(handler h){ ​ h.single_task<>([]{ //Kernel code } } }   void run_kernel2(arguments){ ​ queue->submit(handler h){ ​ h.single_task<>([]{ //Kernel code } } }   void run_kernel3(arguments){ ​ queue->submit(handler h){ ​ h.single_task<>([]{ //Kernel code } } }     //In main.cpp for (some iterations){ h.memcpy(data_device, data_host); run_kernel1(arguments); h.memcpy(data_host, data_device);   h.memcpy(data_device, data_host); run_kernel2(arguments); h.memcpy(data_host, data_device);   h.memcpy(data_device, data_host); run_kernel3(arguments); h.memcpy(data_host, data_device);       }

My question here is that I do not know if doing this I have the certain that the 3 kernels are placed at the same time in the FPGA and , if I call the three of them in host, they are running in parallel without changing the FPGA's bitstream for each kernel.

 

Thank you very much for your feedback!

0 Kudos
Kenny_Tan
Moderator
1,244 Views

Here is some confusion that needs to be cleared up.

  1. The memcpy function is part of the handler class, which you can find in handler.hpp source code that ships with the tool. This function isn’t documented because we recommend using buffer + accessor types in the oneAPI Programming Guide to move data between host and device. So, the first piece of advice to the user is to stop using memcpy.
  2. If the source code containing multiple kernels was built into a single binary, then there shouldn’t be any worries about each kernel being placed in a different bitstream. The fat binary that’s used to run the workload contains only one bitstream with all kernels that it was compiled with.

In order to create a replicated kernel in oneAPI, it’s best to create a templated function and call it with different parameters from the main code. An example of this is attached. That’s preferable to duplicating the same kernel code. You can also pass in different buffers to each templated function call to make sure each kernel operates on different data

0 Kudos
asenjo
Innovator
1,236 Views

Thanks a lot Kenny. But regarding this "An example of this is attached", I can not see where your example is. Thanks once again.

0 Kudos
Kenny_Tan
Moderator
1,222 Views

Seems like there is a problem on attaching a files from my side. You may check your email for the attachment. Thanks.

JRome28
Novice
1,215 Views

Thank you very much for your help. Let me take a look to your file and I write you back.

 
I think this is more or less what I want to do but I need to analyse it to be sure. 
 
Thanks!
0 Kudos
asenjo
Innovator
1,212 Views

Got it! That's right what we needed. Thanks a lot!

0 Kudos
Reply