I have some fundamental question about optimizing memory access pattern in OpenCL code. Based on best practices, it is mentioned that memory access pattern should be contiguous for best performance. In GPU, we makes sure workitems in a same workgroup or warp are having contiguous indexes. That means data access in should be sequential "spatially", since the parallelism only exists in spatial dimension.
In FPGA, it seems to be two opportunities. We can have spatial contiguous data access pattern in kernels like ND-Range, and we may also have temporal contiguous data access in both ND-Range and Single thread mode kernels. For example, if we have a loop and we try to unroll it, maybe it's good to make sure that data access pattern is based on the iteration counter, with 0 stride. Now my question is, how this is being handled while the kernel is being compiled? which dimension has the higher opportunity for being parallelized?
On FPGAs, there is no fixed warp and there is no thread-level parallelism either. Unless you use SIMD or loop unrolling, it doesn't make much of a difference whether your memory accesses are contiguous or not since only one access per access port is performed per cycle and the memory bandwidth will be underutilized.
By default, the compiler creates one access port to external memory for every access that exists in your kernel. The size of this port is equal to the size of the datatype used for the access rounded up to the nearest power of two. Now, when you use SIMD in NDRange kernels or loop unrolling in Single Work-item kernels, apart from widening the pipeline, the compiler will also coalesce all the accesses that are consecutive into a one larger access port; non-consecutive accesses will instead result in as many ports per access as the SIMD/unroll factor with the size of the datatype. This is done at compile-time. If you check the "System viewer" section of the area report, you can see that the ports to memory get wider when SIMD or unrolling is used over consecutive accesses.
Needless to say, best memory performance is achieved with a few very wide coalesced accesses rather than a lot of narrow non-coalesced ones since the latter will create a large amount of contention on the memory bus and significantly reduce memory access efficiency.
P.S. You should probably avoid using both SIMD and unrolling in an NDRange kernel over external memory accesses because it is not usually the cases that the accesses are consecutive both over the direction of the SIMD and the unrolling. SIMD is applied on the first dimension for NDRange kernels; hence, in 2D and 3D NDRange kernels you should make sure your accesses are consecutive over the first dimension to be coalesceable.