How to launch replicated Single Work-Item Kernels (using num_compute_units)

Altera_Forum · ‎06-02-2016

I am trying to understand how to replicate single work-item kernels (tasks) and especially how to call them.

The programming guide (https://www.altera.com/en_us/pdfs/literature/hb/opencl-sdk/aocl_programming_guide.pdf) says (e.g. on page 2-27, but also other places) that you can specify __attribute__((max_global_work_dim(0))) to enforce a kernel to become a single work-item kernel and __attribute__((num_compute_units(2))) to replicate the engine (two replicas in my example). Taking the fft1d code from the Altera OpenCL design examples, this would mean that I can instantiate two independent FFT engines on the FPGA, which seems straight forward enough.

However, what I don't understand is how to launch that kernel such that both replicas are used? The fft1d example performs 2000 ffts, each of size 4096. The 2000 is an input parameter to the fft kernel, which then implements the loop. Now when the kernel is launched with clEnqueueTask(), to my understanding this creates only one work item in one work group and therefore can only run on one of the two FFT engines, right? So how do I have to launch the kernel then such that both engines to half the work (1000 ffts)? I can't do it with clEnqueueTask, because I can't specify how the work is distributed between the engines, and I probably (?) can't use clEqueueNDRange() because it's not an NDRange kernel but a single work-item (task)?

Any help is greatly appreciated!

JSchr20 · ‎04-30-2020

Sadly, this is just about exactly the question I was about to post. I do not know the answer, and I am sad to see that a question nearly four years old has no responses either. :-(

HRZ · ‎05-02-2020

You can only replicate autorun single work-item kernels; autorun kernels do not have an interface to host or external memory and launch automatically as soon as the FPGA is configured with the associated bitstream. The only means of communicating with an autorun kernel is through on-chip channels. The typical way of using autorun kernels is that a non-autorun kernel is used to read data from memory, then the data is streamed using channels through an array of autorun kernels, and finally forwarded to another non-autorun kernel that writes them back to external memory. Refer to Sections 10.3 and 10.4 of the Programming Guide for more information:

https://www.intel.com/content/www/us/en/programmable/documentation/mwh1391807965224.html#ewa1456336930202