I've written a fairly simple kernel that multiplies multi-limb operands. I've written two versions: an ndrange version and a task version. Both work fine, but the performance of the task version is significantly better.
The design is quite small, and I calculated that it only uses maybe 10% of the PCIe bandwidth. I'd like, therefore, to put ~8 copies of this kernel in the FPGA instead of just one, so that I can have 8x the parallelism.
With the ndrange kernel, this appears dead easy: I just put the attribute num_compute_units(8) at the top of my kernel, and my clEnqueueNDRangeKernel just works like a charm, divvying the work up among the compute units the same way it does among different devices.
With the task kernel, however, only one of my compute units ever gets driven. All of my attempts to cause better behavior by adjusting global or local work group size or max or required size only cause weird and undesirable behavior. All the documentation suggests that any attempt I might make to query work item ids or use attribute-driven SIMD vectorization will result in my kernel being an ndrange, and thus non-pipelined, kernel.
I want my kernel pipelined, as the performance is much better. I just want, in one FPGA, eight copies that get fed 1/8th of the data each.
What's the right way to do this??
(I'm using version 19.4 and 20.1.)
I think you are referring to external memory bandwidth and not PCI-E bandwidth because PCI-E bandwidth is determined by the physical features of the PCI-E connection on your FPGA board and motherboard (number of lanes and PCI-E version) and its effective throughput is determined by multiple factors such as the size of your data transfer and the efficiency of the PCI-E driver; these are not really factors that can be controlled by the programmer.
Assuming that you refer to external memory bandwidth, then your problem has a simple solution: you need to use loop unrolling to vectorize your single work-item kernel. Loop unrolling will not only increase the amount of computation that is done by your kernel per cycle, it will also lead to consecutive memory accesses in your loop being coalesced into larger accesses by the compiler which will result in better utilization of the external memory bandwidth. Loop unrolling in single work-item kernels gives a behavior similar to the SIMD attribute in NDRange kernels.
Since this is the forum thread I started, I'll consolidate my 'progress' and followup question here. I followed your advice in your reply here: https://forums.intel.com/s/question/0D70P000006i6SySAI
I now do have a version of my code working, which has a non-autorun kernel that fetches the data out of memory, an autorun kernel that operates on it, and another non-autorun kernel that writes the results back to memory, all interconnected with Intel channels and access with blocking reads and writes.
However, when I try to add a num_compute_units attribute to my kernels (and add in the necessary code to make use of the compute unit IDs), I get this error:
Platform: Intel(R) FPGA SDK for OpenCL(TM)
Using 1 device(s)
EmulatorDevice : Emulated Device
Using emulator, adding '_em' to output filename
Binary filename = my_autorun_em
Using AOCX: my_autorun_em.aocx
Launching for device 0 (4 elements)
Hey, I'm comp_id 0
I'm operating with NUM_LIMBS = 4 and NUM_COMPUTE_UNITS = 1.
about to read first operand
terminate called after throwing an instance of 'std::bad_alloc'
I'm guessing my other kernels aren't relevant since they haven't even been started yet. I've attached the code snippet of my autorun kernel I'm guessing is relevant. (Pardon the .c extension, amazingly, we can't upload .cl files here!) Is it obvious what I'm doing wrong?
I am using legacy emulation, by the way. When I try fast emulation, I get a seg fault.
Well, things have gotten weirder.
The error I posted above is coming from legacy emulation on my local compute server using aoc 20.1.
We don't have fast emulation working yet on our local server.
On the dev cloud, where can use fast emulation, I'm uniformly getting a seg fault, as I described above. In addition, I noticed today that one of my kernels reports a compute ID of -1 a few print statements before the seg fault.
However, also on the dev cloud, if I use legacy emulation with num_compute_units=1, it works. My program runs and declares PASS on valid answers.
If I run on the dev cloud with num_compute_units=2, legacy emulation mostly works, but it hangs without completing. Fast emulation seg faults as before.
So anyhow. I'm going to hunt for more clues.
I am not sure if this is an artifact in your snippet but it seems you are reading from and writing to channels with the same ID as the compute unit (albeit with different channel names). One would typically read from a previous compute unit and write to a following one, you should not read from and write to the same channel ID in the same compute unit. Another potential pitfall is channel ordering; the compiler will freely re-order channel operations and if there is a cycle of channels in your design, you can potentially run into a dead-lock unless you enforce channel re-ordering using barriers as described in Intel's documentation. Finally, it seems v20.1 is quite problematic based on reports from you and other people in the forum, you might want to consider switching to v19.4 on your local machine and see if you would run into the same problems. Assuming that you can create a minimal example that reproduces the issue and post it here, it will be easier to find potential issues in the code.
Would recommend to try out the code in v19.4 as mentioned in previous answer .
Please refer to the programming guide for some more clarification regarding recommended practices.
Please adhere to the statements in section
5.4.4. Restrictions in the Implementation of Intel FPGA SDK for OpenCL Channels Extension
like the one below.
"Performance of a kernel that has multiple accesses (reads or writes) to the same channel might be poor."
Also please refer to the following thread for some rules about multiple channel writes.
For implementing fence also please refer to
Please let us know , once you have rewritten the code following the guidelines.
Thanks and Regards
I still don't have this code working, but I'm stuck at a "higher level", at least. I now have only one autorun kernel with num_compute_units > 1, and the ndrange kernels on either end that feed it data and gather results are singletons that distribute the appropriate slices of the incoming (now num_compute_units times as big) work items to the various channels (or gather them at the other end). This code now works for few work items, but not for many. Mem and channel fences haven't helped. Simulation and hardware builds fail without useful messages. If I exhaust my debug avenues or if I find a solution, I'll post here again.
Thank you, that did the trick! I noticed that your read and write queues were different in your code. I had everything in one queue; I thought enqueuing was non-blocking, and I had my triggering events set up such that everything should have been able to launch and run. That must not have been the case, though, and some kernel enqueue was perhaps waiting on another in a way I didn't expect. I switched to two queues to separate the kernels on either side of my autorun kernel, and now I no longer hang once my FIFOs fill up. I still don't really get why that had to happen, but... all's well that ends well, I guess. Thanks again!
Indeed the enqueue operations are non-blocking (from the point of view of the host), but each queue can only execute one operation on the device at a time which means the actual execution of the queued operations or kernels on the device happens sequentially. To be able to execute multiple kernels in parallel on one device, you need one separate queue for each such operation.