I have tried increasing work items per work group with number of CU = 1 using the following attribute__attribute__((reqd_work_group_size(BLOCKDIM,BLOCKDIM,1))) in the matrix multiplication example. In first case : I have set the BLOCKDIM = 16 In the second case : I have set BLOCKDIM = 32 I get almost same resource usage and latency. The number of DSP units are also same. I only observe propotional increase in resource usage when I increase compute units, SIMD or unrolling loops. How does work item get mapped in FPGA fabric.
I have explained this multiple times in different threads, so I will just post some of the existing discussions:https://www.alteraforum.com/forum/showthread.php?t=56121 https://www.alteraforum.com/forum/showthread.php?t=55522 https://www.alteraforum.com/forum/showthread.php?t=57567 tl;dr: There is no thread-level parallelism in NDRange kernels unless you use SIMD or compute unit replication. Threads are just pipelined and increasing work-group size will hardly affect performance or area usage.
What about BRAM blocks when my thread uses private memory? Would the private memory allocated for one item gets reused? What happens when I use attributes like SIMD and CU? For eg:- In the case of GEMM example, doubling the SIMD attribute doubles the BRAM usage. I can understand that doubling of BRAM blocks with increasing CU attribute but why would SIMD attribute increase BRAM. Is it because of the local memory replication? Is there any attribute to set a limit for the replication of local memory?
Private memory is generally implemented using registers and does not use Block RAMs. Large local memory buffers are implemented using Block RAMs. CU increases Block RAM usage with the CU factor since it replicates the whole pipeline. The effect of SIMD is not straightforward. If accesses to your local buffer are coalesced under the presence of SIMD, which means the number of ports to that buffer does not change, then replication factor stays the same and Block RAM utilization hardly changes. If, however, such accesses are not consecutive and cannot be coalesced, then using SIMD which increase the number of ports by the SIMD factor and can significantly increase the replication factor. Block RAM replication factor depends on the number of barriers in the kernel, number of accesses to the buffer, and number of work-groups the compiler decides to run simultaneously. The latter cannot be directly controlled by the user. Some attributes are provides to control banking and number of ports for local memory buffers. You can find the details in "Intel FPGA SDK for OpenCL Best Practices Guide, Section 7.5 - Optimizing Accesses to Local Memory by Controlling the Memory Replication Factor".
--- Quote Start --- Number of work groups running simultaneously depends on the compiler? Is it not the number of compute units? --- Quote End --- There are two different levels of work-group concurrency: - Work-groups are pipelined one after another in the same compute unit. Each region between two barriers can be occupied by a different work-group to maximize to maximize pipeline efficiency. The number of work-groups that can be in-flight in the same compute unit is decided by the compiler and is not controllable by the user. This is what I was referring to above. - Work-groups can also run in parallel in different compute units. Altera recommends having at least three times more work-groups than compute units so that coupled with work-group pipelining, circuit occupancy is maximized.