I'm trying to develop a oneAPI application that uses a long-running persistent kernel running independently of the host process and uses short-lived kernels to coordinate messages between the host and persistent kernel (which I hope will form the basis of a network packet processing kernel). I am using kernel-to-kernel pipes and kernel lifetimes to signal events to and from the host process until host-to-kernel pipes are implemented. I have a timer kernel which runs an exact number of iterations to generate events at a deterministic time interval for use on the FPGA. Currently, all it does is signal an event back to the host and the host prints the time it took to run. I have gotten the whole setup working as expected, except when I introduce a global memory access inside the persistent kernel, I get random hangs and it looks like some pipe messages are getting dropped. I am a software engineer and don't have any experience with HDL or FPGA specific constructs but have done some research to try and understand what might be going on. My best guess looks like this has something to do with stallable instructions on the FPGA when accessing memory. The compiler is generating Burst-Coalesced LSUs which mention trying to aggregate memory operations to improve efficient access. In this case it looks like the pipes are dropping messages, but I don't fully understand the behavior and am hoping someone with more experience can explain it to me since I'm unable to debug the design at the FPGA simulation level. I'm not sure if this is a bug or if my code is violating some assumptions, but from what I can tell, I don't see anything obviously wrong. I put the code up in a Git repo to reproduce the issue: https://github.com/AustinKnutsonSprint/oneapi-timer-kernel-hang To remove the hang, comment out line 101 which is the problematic global memory access. I have been running my tests on devcloud with an Arria 10 FPGA.
I may have stumbled across a solution but I'm not sure if it's correct. By wrapping accesses to global memory from different concurrent kernels with atomic_fence, it removes the hang. I updated my example repo here: https://github.com/AustinKnutsonSprint/oneapi-timer-kernel-hang/tree/fences
Hopefully someone from Intel can confirm what the underlying problem is and whether this is an appropriate fix.
Kernel hangs with pipes pretty much always happen due to one of the following two reasons:
- The amount of data read/written to a pipe is not equal to the amount written/read on the other side. This would result in a hang during software emulation, too.
- Existence of a cycle of pipes in the kernel where, in case of pipe read/write operations being reordered by the compiler, could result in a kernel hang. This will not show up in software emulation.
I am not familiar with OneApi, but the backend compiler is supposedly the same the OpenCL compiler. I would assume just like OpenCL, there should also be some barrier pragma or something that allows forcing ordering of pipe operations and preventing the compiler from reordering them. The first debugging step in your case would probably be to add such a barrier after every pipe read and write operation in your code to see if the hang is the result of operation re-ordering by the compiler.
P.S. Pipes will never "drop" data; that is why they can cause hangs.