Local memory in one work group

Altera_Forum · ‎12-08-2013

Say we define two __local memory A[1024] and B[1024] in the kernel function and the data-flow is DDR--> A --> B -->DDR, my question is that do they combine their read/write ports to generate common local memory system or preserve the individual read/write ports? I am sure that the common local memory system will greatly degrade the performance, even worse when many __local variables are defined.

According to the kernel log: " .. kernel number of local memory banks : 1 1 1 1 1 1 1 ", Does it mean 7 banks exist in my design?

The memory utilization is much more than the __local memory defined in the OpenCL code, do the delay operations (with wider datapath, e.g. 32) require a lot of Block RAMs when the OpenCL code has relatively complex logic? BTY, I have roughly read the *.v generated by the AOC, and a lot of FIFO (with *.IMPL = "ram") are generated for delay operations. Does it mean we should avoid complex logic by generating multiple kernels?

Altera_Forum · ‎12-13-2013

There are a lot of answers to that question because it depends on the kernel. In general I have these recommendations:

1) Use the __restrict keyword any time it's safe to do so (i.e. when you don't need to worry about pointers aliasing the same data)

2) Make sure you don't use the same pointer to dereference multiple memory spaces (i.e. in this case don't have a pointer that has to access both A and B, use multiple pointers instead)

3) Use __private temporary storage when possible

I can't really comment on whether it makes sense for the algorithm to be split into multiple kernels. In general I try to keep everything combined into a single kernel since it's normally more effiicient to have a single pipeline than a bunch of independent pipelines in terms of performance and area.

Altera_Forum · ‎12-18-2013

Nice recommendations. I do not think that keeping everthing combined into a single kernel is a good idea, since the resource requirement increases exponentially and the clock frequency drops quickly while the control logic is complex.

Altera_Forum · ‎12-18-2013

It depends on the kernel(s) really. Sometimes what you said is indeed true other times it's not. Again these are just general rules of thumb based on what I've seen across many designs.