Re:OneAPI FPGA Concurrent Kernels

jbick819 · ‎05-18-2022

Hello,

I've been experimenting with ways to optimize my code and execute multiple "single_task" kernels in parallel, and my main interest is creating a single kernel code, then calling it multiple times with different parameters to customize execution (i.e. boundaries to read from an array). However, in my current design, it appears that the kernels just execute sequentially. I do not have any properties added to my sycl queue that dictate in order vs out of order execution, so it should be defaulting to out of order if I'm not mistaken.

Is there a particular way to achieve this goal of creating a single kernel function and calling it multiple times that I am missing? To better illustrate what I am attempting to do, I'll add an example below.

void kernelCode(...) {
     q.submit([&](sycl::handler &h) {
          h.single_task([=]() {
               //work
          });
     });
}

void otherFunc(...) {
     kernelCode(...); //1st kernel
     kernelCode(...); //2nd kernel
     ... //several more times, all exe at the same time
}

Any help that can be provided would be much appreciated!

aikeu · ‎05-20-2022

Hi jbick819,

Can refer to this document as reference:

https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top/optimize-your-design/throughput-1/host/multi-threaded-host-application.html

Thanks.

Regards,

Aik Eu

jbick819 · ‎05-23-2022

Hello,

Thank you for the response! I went ahead and tried using OpenMP to run the host code, with each thread calling an iteration of the kernel function, but still was unable to get actual parallel execution based on timing information I collected.

Could this be caused by sharing a single queue between threads, or using USM and having each thread share those memory pointers? I'm currently testing implementations that make use of buffers and accessors that wrap around USM pointers, as well as trying to give each thread its own queue to submit the kernels to.

Any additional insight would be much appreciated!

aikeu · ‎05-25-2022

Hi jbick819,

Can share with me how you write the code on your side to carry out the tasks in parrellel?

Thanks.

Regards,

Aik Eu

jbick819 · ‎05-27-2022

Hello,

Sure thing, so rather than repeatedly calling that single kernel function as I note above, I essentially copy/paste the code several times and manually change the necessary parameters. It looks like this in code:

void workFunc(...) {
     //kernel 1
     q.submit([&](sycl::handler &h) {
          h.single_task([=]() {
               //work
          });
     });

     //kernel 2
     q.submit([&](sycl::handler &h) {
          h.single_task([=]() {
               //work
          });
     });

     //kernel 3
     q.submit([&](sycl::handler &h) {
          h.single_task([=]() {
               //work
          });
     });

     //wait for all tasks to complete
     q.wait();
}

When I do it like this, all kernels execute in parallel based on timing information I collect.

Thank you!

aikeu · ‎05-30-2022

Hi jbick819,

Good to know that the code is working with the correct execution order for tasks to be run in parrellel.

I will close this thread for now.

Thanks.

Regards,

Aik Eu

jbick819 · ‎05-30-2022

Hello,

I was actually still hoping to get my program to execute in parallel using the method I depict in the original post (05-18-2022): 1 kernel function code called multiple times. This is because that approach uses far fewer resources than the one I have right now where I copy and paste the q.submit repeatedly (05-27-2022 post). This will allow me to run many more kernels in parallel, as the current design maxes out resource utilization at only 5 kernels.

Thus, am I able to get things working by using the single function structure? Since OpenMP did not help, are there additional pragmas or strategies that I may use? Or is it simply not possible?

Thank you!

aikeu · ‎06-07-2022

Hi jbick819,

I get the information from the team as below, hopefully it helps to provide you some options to work with your code:

"

If you wanto re-use the same kernel multiple times in a pipeline-parrallel fashion. There are a few things you can try to allow the host to 'share' the kernel multiple times.

1. Invoke your kernel over a range using the parallel_for function. The downside with using parallel_for is that the compiler will not pipeline any loops in your kernel unless you use the Xsauto-pipeline flag. Parallel_for also requires you to know ahead of time how many times you want to invoke the kernel.

2. You can ‘demote’ the repeated kernel calling into the kernel itself and allow the compiler to pipeline the ‘demoted’ loop. The loop can be a while(1) loop if you wants it to run indefinitely or it can be a regular bounded for-loop. Then use USM reads like the you suggested earlier. Then instead of invoking the kernel multiple times, you just call it once and then pass it data in your repeated subroutine calls.

"

Thanks.

Regards,

Aik Eu

aikeu · ‎06-09-2022

Hi jbick819,

I will close this thread if no further question.

Thanks.

Regards,

Aik Eu