Is there any way to explicitly use BRAMs for the synthesized OpenCL kernel? I'm aware that __local memory will be implemented with BRAMs, but I was also interested in squeezing extra performance from the generated design by using BRAMs instead of DRAMs for some buffers.Given the locality spectrum associated with __local memory, I cannot make use of BRAMs through __local buffers.
At the moment the answer is no but we will take this into consideration since I agree being able to use block memory that is globally accessed is a beneficial thing.
--- Quote Start --- At the moment the answer is no but we will take this into consideration since I agree being able to use block memory that is globally accessed is a beneficial thing. --- Quote End --- I understand that complying to the OpenCL conformance tests came first in the list. However, ones should also note that while OpenCL is a cross-platform programming model, it was not originally developed with FPGAs in mind, otherwise it would carry in its specification constructs that are 'hardware-aware'. AFAIK, BRAMs are used only when __local buffers are instantiated, right? IMHO, I can think of two possible ways of using BRAMs. The first by expanding the concept of __local buffers within the AOCL spectrum only. For instance, one could define a global variable residing on __local memory. This way, an initialization kernel could be defined to explicitly move data from DRAM to BRAM. A naive approach, I assume. Alternatively, a cl_mem object lying in the DRAM could be mapped onto a cl_mem object declared lying in the BRAM, by using existing clEnqueueMapBuffer function, although the function relates device with host memory space and not device with device memory. Instead of a DMA access from device to host, some mechanism would fetch data from DRAM and place in accordingly on the BRAM. Either way, I think that it might take quite an effort to pull an explicit BRAM usage within the OpenCL 'boundaries'. How can we state how many blocks do we need and want our kernels to access within OpenCL? None of the above solutions addresses such case.
The closest you have today is __constant memory where on-chip memory is used as a cache for read-only data, but if you need write access as well then this is not appropriate.For read/writeable fast memory today you need to pair up __local and __global memory and perform scratch pad copies explicitly in your code which is typical OpenCL way to utilize the memory more efficiently. This has the limitation of only the work-group having visibility into the __local memory. Stay tuned to future releases, as the compiler evolves the feature you are looking for may appear.
How is __constant implemented? I remembered I had a design earlier this year which read from a ROM-stored LUT, while this was good enough for 2005 FPGAs, for current ones it's not. In the end we had to synthesize the ROMs to registers due to the negative impact on performance (we got some 30% faster critical path). Is the __constant memory space synthesized?Regarding __local memory, the visibility scope of buffers declared with __local render a normal BRAM usage of __local buffers tremendously limited. I'm staying tuned 🙂
If you declare it at a file scope then the contents will basically be stored in a ROM. File scope __constant arrays need to be preinitialized so that the contents can be baked into the programming file.If you declare it as a kernel argument then a read-only cache will be implemented so that contents from SDRAM when read get cached by the __constant cache. If you are familar with instruction caches of a CPU it's a lot like that. This __constant cache can be warmed up because the contents will only become evicted back out to SDRAM if the host overwrites that values between kernel invocations. Here is an example of how this can be handy. Let's say you have a FIR filter and you will keep the coeffients constant but you don't want to hardcode them at the file scope. If you call the kernel multiple times across different NDRanges you can move the coefficients down to the FPGA before invoking the kernel the first time, then avoid copying the coefficients for subsequent calls of the same kernel. Between these kernel launches the cache should remain warmed up and the coeffiecients ready to go for each kernel call after the first one. Of course this assumes the __constant cache is large enough to hold the data you want to have cached, by default it's 16kB but you can override this using a flag to pass in a new size (remember the value is in bytes, and I recommend powers of 2 for the cache size).