How can we do pipelining between CPU thread and FPGA kernel?

Altera_Forum · ‎03-05-2018

I want to do pipelined processing between the host and FPGA.

Say I have kernel0 implemented on host while kernel1 implemented on FPGA.

Kernel0 may generate 1000 data where kernel1 will process them one by one.

The problem is that I want kernel1 to start processing before all the 1000 data are generated by Kernel0.

However, seems there is no method like channel provided to do the communication between the two threads...

Cheng Liu

Altera_Forum · ‎03-05-2018

You want live in XXII Century ?

This may be simple with "smart" commands of read memory on both sides -- only after write need address or in this moment.

CPU and FPGA may have "spy" blocks on global bus that see all transaction on common "memory".

If write address is near and greater than waited -- read memory, if equal -- get data direct from bus on fly.

Your FPGA may give to CPU a region of addresses that put all writing data to FIFO.

This all is`nt OpenCL. However, Intel may add this possibility to channel input.

Altera_Forum · ‎03-06-2018

--- Quote Start ---

You want live in XXII Century ?

This may be simple with "smart" commands of read memory on both sides -- only after write need address or in this moment.

CPU and FPGA may have "spy" blocks on global bus that see all transaction on common "memory".

If write address is near and greater than waited -- read memory, if equal -- get data direct from bus on fly.

Your FPGA may give to CPU a region of addresses that put all writing data to FIFO.

This all is`nt OpenCL. However, Intel may add this possibility to channel input.

--- Quote End ---

Yes, you are right. I see the problem.

Thank you

Cheng Liu

Altera_Forum · ‎03-06-2018

What you are describing is not an FPGA-specific problem; the same problem applies to all accelerators. Many people have worked on streaming/pipelining computation between a CPU and GPU, there should also be multiple examples of doing this on FPGAs (probably not with OpenCL, though). Usually this is done by double- or multi-buffering, where input is partitioned into multiple chunks, one chunk is processed by the CPU and then written to buffer A on the accelerator. While the accelerator is processing that buffer A, the CPU processes the second chunk and writes to buffer B on the accelerator. Then the accelerator switches to buffer B when buffer A is done, and the CPU switches from buffer B to buffer A and so on. You can use OpenCL events, or custom locks/flags, to synchronize the accelerator and the CPU in this case.

The concept of host channels have also been recently added to Altera's compiler that allows you to stream data directly from the host to the FPGA, but that is only available on Altera's reference board.

Altera_Forum · ‎03-06-2018

--- Quote Start ---

What you are describing is not an FPGA-specific problem; the same problem applies to all accelerators. Many people have worked on streaming/pipelining computation between a CPU and GPU, there should also be multiple examples of doing this on FPGAs (probably not with OpenCL, though). Usually this is done by double- or multi-buffering, where input is partitioned into multiple chunks, one chunk is processed by the CPU and then written to buffer A on the accelerator. While the accelerator is processing that buffer A, the CPU processes the second chunk and writes to buffer B on the accelerator. Then the accelerator switches to buffer B when buffer A is done, and the CPU switches from buffer B to buffer A and so on. You can use OpenCL events, or custom locks/flags, to synchronize the accelerator and the CPU in this case.

The concept of host channels have also been recently added to Altera's compiler that allows you to stream data directly from the host to the FPGA, but that is only available on Altera's reference board.

--- Quote End ---

Thanks for the suggestions,

Yes, I see there is host channel in the document, and it is limited to one input stream and one output stream if I remember correctly.

There are also io stream, which seems to be used for simulation. I may explore them later.

For now, I will try the double buffering scheme and see if this helps to improve my design.

Cheng Liu

Altera_Forum · ‎03-07-2018

I am very interested in using the direct host to kernel channel but I can't find any documentation on how to implement it. Does someone have some pointers ?

Altera_Forum · ‎03-08-2018

--- Quote Start ---

I am very interested in using the direct host to kernel channel but I can't find any documentation on how to implement it. Does someone have some pointers ?

--- Quote End ---

Check Section "Intel FPGA SDK for OpenCL Programming Guide, Section 5.5.6".

Altera_Forum · ‎03-08-2018

Many thanks, I completely missed this section.

Any idea on the latency/throughput compared with typical global memory accesses ?

Altera_Forum · ‎03-08-2018

--- Quote Start ---

Many thanks, I completely missed this section.

Any idea on the latency/throughput compared with typical global memory accesses ?

--- Quote End ---

If you have access to ACM, check this paper from Altera regarding implementation of host pipes:

https://dl.acm.org/citation.cfm?id=3078182