How many clock cycles to transfer between global and local memory?

WWP00 · ‎07-23-2019

For example, suppose we have a local memory array:

float local[10];

And a much larger, global memory array. Would we copy like:

int memStart = 50;

for (int i = 0; i < 10; ++i)

local[i] = globalMem[memStart + i];

Or should we use pragma unroll for this copy, to avoid making the loop take one clock cycle per copy? Or is there some other recommended way to move array data between local and global memory?

Does this transfer take 10 clock cycles, or lesss than that?

HRZ · ‎07-24-2019

Latency of global memory accesses is not fixed and depends on many factors including but not limited to access size, access pattern, LSU (load-store unit) type and possible stalls. The compiler usually allocates 100 to 200 pipeline stages for global memory reads to maximize stall absorption. However, the latency could be even higher if there are multiple ports to memory, resulting in contention on the memory bus. You can check the latency of the different parts of your design in the "System Viewer" tab of the HTML report.

In your code example, you will only use a small percentage of the external memory bandwidth and the loop should be unrolled to allow compile-time coalescing and improving memory performance. You can take a look at the following thread for more information on the math behind the memory throughput and choosing the best vector/unroll size:

https://forums.intel.com/s/question/0D50P00003yyTckSAE/loopunrolling-and-memory-access-performance?language=en_US