I would like to implement CNN using OpenCL and run on FPGA.The CNN implementation will contain 4 kernel function layer1_kernel, supposed allocate 1 MB local memory layer2_kernel, supposed allocate 1 MB local memory layer3_kernel, supposed allocate 1 MB local memory layer4_kernel, supposed allocate 1 MB local memory Host program will call clEnqueueNDRangeKernel() function in sequence. that is, pervious kernel need finished, then the next kernel will be enqueued. What is the total RAM block resource to be used? 4MB or 1MB? Thanks, Matt
If you put all the kernels in the same cl file, then all of them will be implemented as one FPGA image/bitstream and memory size will be 4 MB. However, if you put them into separate .cl files and compile them separately and load them one by one on the FPGA, the memory usage will be 1 MB, but the FPGA will need to be reconfigured each time you call a new kernel.
Thanks for your reply.the FPGA will need to be reconfigured each time you call a new kernel. ==>Do you mean that I need to call the following function again? getBoardBinaryFile() clCreateCommandQueue() clCreateKernel() clEnqueueNDRangeKernel() Thanks, Matt
From my experiments it is the "clCreateProgramWithBinary()" function that reconfigures the FPGA. You will have to call that function and every other function that comes after that (clBuildProgram(), clCreateKernel(), clEnqueueNDRangeKernel(), etc.) every time you want to switch to another kernel that resides in another FPGA image. You do not need to create new queues, though; you can reuse the same queue.Please note that since FPGA reconfiguration can take up to a few seconds, unless your kernels have a long run time and you rarely switch between them, the overhead of FPGA reconfiguration could actually be higher than kernel run time and you might as well put all your kernels in the same FPGA image so that you do not need to reconfigure the FPGA between kernel runs.
From my experience, you can build all the programs and create all the kernels up front as an initialization stage to get a handle to all the kernels, calling the buildProgram and createKernel once for each binary, and then just enqueue the kernels as you need them.When you do, the first binary that gets loaded will be the first one programmed onto the board during the clCreateProgramWithBinary() call. Then you can enqueue the kernels from separate binaries as long as you have the kernel handles to them. If you enqueue kernel 1 from binary 1, it will simply run since the binary is already preloaded from before. Then if you enqueue kernel 2 from binary 2, it will automatically reprogram the board (if it is created with the same FPGA) in order to run kernel 2. It only reprograms the board as necessary. Hopefully that could save some extra overhead reducing the number of calls on the host. As noted before, switching from binary to binary introduces a significant amount of overhead.