Reducing memory replication

Altera_Forum · ‎09-15-2017

Hi, I'm working on an OpenCL kernel that is using a 2MB dataset and I've currently been reading in the entire 2MB into on-chip memory, performing the operations (random read/writes) and then outputting the 2MB result back to global memory.

I've had no problems doing this as a single-work-item kernel but when I attempt to parallelize the kernel by adding a for-loop with a# pragma unroll 1 I get a massive blowup in local-memory usage from the tools

+--------------------------------------------------------------------+

; Estimated Resource Usage Summary ;

+----------------------------------------+---------------------------+

; Resource + Usage ;

+----------------------------------------+---------------------------+

; Logic utilization ; 39% ;

; ALUTs ; 21% ;

; Dedicated logic registers ; 19% ;

; Memory blocks ; 697% ;

; DSP blocks ; 5% ;

+----------------------------------------+---------------------------;

Private memory: Potentially inefficient configuration
Requested size: 2097152 bytes
Implemented size: 33554432 bytes
Number of banks: 2 (banked on lowest dimension)
Bank width: 1024 bits
Bank depth: 8192 words
Total replication: 16 - Replicated 16 times to create private copies for simultaneous execution of 16 threads in the loop containing accesses to the array.
Running memory at 2x clock to support more concurrent ports
Additional information: Requested size 2097152 bytes, implemented size 33554432 bytes, replicated 16 times total, stallable, 4 reads and 3 writes. - Reduce the number of write accesses or fix banking to make this memory system stall-free. Banking may be improved by using compile-time known indexing on lowest array dimension. - Replicated 16 times to create private copies for simultaneous execution of 16 threads in the loop containing accesses to the array. - Banked on lowest dimension into 2 separate banks. - See best practices guide: local memory (https://www.altera.com/documentation/mwh1391807516407.html#chn1469549457114) for more information.
Private memory implemented in on-chip block RAM.

Any ideas on how to stop this replication?

Altera_Forum · ‎09-15-2017

Can you post the code that is implementing the local memory?

Altera_Forum · ‎09-16-2017

Local memory buffers need to be replicated by the number of accesses to those buffers to allow parallel access. With unrolling, unless the accesses can be coalesced, you further increase the number of accesses to the buffer and hence, increase the replication factor. To stop the replication you should avoid unrolling the loop. Loop unrolling without replicating the local buffer will not result in any performance improvement. Check "Intel FPGA SDK for OpenCL Best Practices Guide, 1.8.5 Optimize Accesses to Local Memory by Controlling the Memory Replication Factor" and "Intel FPGA SDK for OpenCL Programming Guide, 2.2 Kernel Attributes for Configuring Local Memory System" for more info.