Solved: Which FPGA board should I choose if I want to manipulate SSD with FPGA and OpenCL? How many boards do I need to manipulate at least 10TB of SSD?

User1588414328403410 · ‎05-02-2020

Our team is working on simulating a piece of rat brain, approximately 5,000,000 neurons and 500,000,000,000 synapses and we are planning to use FPGA and OpenCL to accelerate the simulation. However, the storage of synapse weights is a big problem because it require extremely large memory and fast reading and writing. Personally I think using SSD would probably be a better idea rather than using DDR. So I am wondering if it is possible to manipulate SSD with FPGA and OpenCL (allocating global memory in SSD), which FPGA board we should choose, and how many boards we need to purchase.

HRZ · ‎05-02-2020

Neither Intel nor Xilinx has any solution/board that would allow you to allocate OpenCL global memory in an SSD. There are a few Xilinx-based boards that support NVME drives, but you will just get a barebone IP with those boards that only implements the NVME protocol and is supposed to be used with Hardware Description Languages (Verilog, VHDL, etc.).

A better solution for you will likely be out-of-core processing. Intel provides a host channel extension for OpenCL which allows data to be directly streamed from the host to the FPGA, bypassing the FPGA DDR memory; however, I do not know of any boards other than Intel's reference boards (a10gx and s10gx) that support this extensions, and these reference boards are typically not very useful for production-level development due to their small on-board memory size. Moreover, this extensions is only applicable if your computation can be fully described in a streaming manner. A more general solution for you would be to split your data on the SSD into multiple chunks that are smaller than the FPGA on-board memory size and implement double-buffering on the DDR memory were you move one chunk of data to the DDR memory and while that chunk is being processed by the FPGA, you move another chunk to the DDR memory from the host and as soon as the first chunk is computed, you swap the buffers and continue doing this until the whole data set on the SSD is processed; this is not very difficult to implement using OpenCL. There is a very large body of work on out-of-core processing using GPUs which you can refer to, to get some ideas.

View solution in original post

HRZ · ‎05-02-2020

Neither Intel nor Xilinx has any solution/board that would allow you to allocate OpenCL global memory in an SSD. There are a few Xilinx-based boards that support NVME drives, but you will just get a barebone IP with those boards that only implements the NVME protocol and is supposed to be used with Hardware Description Languages (Verilog, VHDL, etc.).

A better solution for you will likely be out-of-core processing. Intel provides a host channel extension for OpenCL which allows data to be directly streamed from the host to the FPGA, bypassing the FPGA DDR memory; however, I do not know of any boards other than Intel's reference boards (a10gx and s10gx) that support this extensions, and these reference boards are typically not very useful for production-level development due to their small on-board memory size. Moreover, this extensions is only applicable if your computation can be fully described in a streaming manner. A more general solution for you would be to split your data on the SSD into multiple chunks that are smaller than the FPGA on-board memory size and implement double-buffering on the DDR memory were you move one chunk of data to the DDR memory and while that chunk is being processed by the FPGA, you move another chunk to the DDR memory from the host and as soon as the first chunk is computed, you swap the buffers and continue doing this until the whole data set on the SSD is processed; this is not very difficult to implement using OpenCL. There is a very large body of work on out-of-core processing using GPUs which you can refer to, to get some ideas.

User1588414328403410 · ‎05-03-2020

Thank you for replying! After reading your reply, I did some calculation and it seems like even using SSD is too slow for our simulation (almost 1 hour for 1ms simulation), so now I decided to use DDR4 to store the synapse data which hopefully could reduce the time to within 5 minutes. However, now my questions have become whether it is possible to assign global memory in a external DDR4, and which Intel FPGA boards support the highest external DDR4.

HRZ · ‎05-03-2020

By default OpenCL global memory is always stored on the FPGA external memory; however, you will still need to move the data from your host to the FPGA external memory through PCI-E, and depending on the characteristics of your application, this data transfer itself could become the bottleneck of performance.

Either way, if you application is bound by memory bandwidth, then I would suggest boards with Stratix 10 MX FPGAs which come with HBM memory. Bittware already has one such board with OpenCL support (Bittware 520N-MX), Intel themselves will probably also release one such board sooner or later. Note that you only get 16 GB of memory in this case. If 16 GB is not enough for you, then both Intel (Intel D5005) and Bittware (Bittware 520C) also have Stratix 10 GX boards with 4 banks of 8 GB DDR4 memory, totaling 32 GB of memory, that have OpenCL support, but of course the total memory bandwidth of these boards is much lower than the HBM boards. I am sure you can find boards with even more DDR memory, but they will likely not support OpenCL, or they will not support all of the available memory to be used with OpenCL. If necessary, you can also buy multiple such boards and use them together in the same compute node.

User1588414328403410 · ‎05-06-2020

After discussing with the team leader, it seems like we are going to use SSD to build a prototype, and then to use Stratix 10 MX or Stratix 10 GX on a super computer. I am not very sure about Bittware because I know very little about it, although it seems to satisfy my requirement according to your description. Anyway, thank you for your help!

MEIYAN_L_Intel · ‎05-11-2020

Hi,

For more information for the devices, you may need to refer to the link below:

Stratix 10 MX Development Kits

https://www.intel.com/content/www/us/en/programmable/products/boards_and_kits/dev-kits/altera/kit-s10-mx.html

Stratix 10 GX Development Kits

https://www.intel.com/content/www/us/en/programmable/products/boards_and_kits/dev-kits/altera/kit-s10-fpga.html

Thanks