Showing results for 
Search instead for 
Did you mean: 
Valued Contributor III

Reducing memory replication

Hi, I'm working on an OpenCL kernel that is using a 2MB dataset and I've currently been reading in the entire 2MB into on-chip memory, performing the operations (random read/writes) and then outputting the 2MB result back to global memory. 


I've had no problems doing this as a single-work-item kernel but when I attempt to parallelize the kernel by adding a for-loop with a# pragma unroll 1 I get a massive blowup in local-memory usage from the tools 



; Estimated Resource Usage Summary ; 


; Resource + Usage ; 


; Logic utilization ; 39% ; 

; ALUTs ; 21% ; 

; Dedicated logic registers ; 19% ; 

; Memory blocks ; 697% ; 

; DSP blocks ; 5% ; 




  • Private memory: Potentially inefficient configuration  

  • Requested size: 2097152 bytes  

  • Implemented size: 33554432 bytes  

  • Number of banks: 2 (banked on lowest dimension)  

  • Bank width: 1024 bits  

  • Bank depth: 8192 words  

  • Total replication: 16 - Replicated 16 times to create private copies for simultaneous execution of 16 threads in the loop containing accesses to the array.  

  • Running memory at 2x clock to support more concurrent ports  

  • Additional information: Requested size 2097152 bytes, implemented size 33554432 bytes, replicated 16 times total, stallable, 4 reads and 3 writes. - Reduce the number of write accesses or fix banking to make this memory system stall-free. Banking may be improved by using compile-time known indexing on lowest array dimension. - Replicated 16 times to create private copies for simultaneous execution of 16 threads in the loop containing accesses to the array. - Banked on lowest dimension into 2 separate banks. - See best practices guide: local memory ( for more information.  

  • Private memory implemented in on-chip block RAM.  



Any ideas on how to stop this replication?
0 Kudos
2 Replies
Valued Contributor III

Can you post the code that is implementing the local memory?

0 Kudos
Valued Contributor III

Local memory buffers need to be replicated by the number of accesses to those buffers to allow parallel access. With unrolling, unless the accesses can be coalesced, you further increase the number of accesses to the buffer and hence, increase the replication factor. To stop the replication you should avoid unrolling the loop. Loop unrolling without replicating the local buffer will not result in any performance improvement. Check "Intel FPGA SDK for OpenCL Best Practices Guide, 1.8.5 Optimize Accesses to Local Memory by Controlling the Memory Replication Factor" and "Intel FPGA SDK for OpenCL Programming Guide, 2.2 Kernel Attributes for Configuring Local Memory System" for more info.

0 Kudos