Re: loop-unrolling and memory access performance

Altera_Forum · ‎03-24-2018

Hello, :)

I appreciate it if anyone can suggest something on my question:

I have an OpneCL task hardware (1CU, 1PE), which contains a main for loop. Iterations are independent. In every iteration, there is a burst global memory access (with another inner for loop) which is the bottleneck of design. Because the size of memory access in each iteration is variable, I could not manage a NDrange-with SIMD solution. Instead, keeping OCL design as task, I did a loop-unrolling by 4 on the main loop, hopping to have 4 parallel memory accesses. While hardware size is largely increased (almost 4 times), but the runtime not improved.

I used profiling tool and I saw the occupancy (activity) factor of memory access instruction is divided among 4 instructions after unrolling, which in my opinion it means 4 memory accesses run serially, not in parallel. Does anybody have any opinion on this? Can it be because of not replicating memory ports?

thanks

Altera_Forum · ‎03-24-2018

To improve memory performance in task kernels, you must unroll your memory access loop on the dimension in which the accesses are consecutive. Take the following example:

__global float input;
__local float data;
for (i = 0; i < N; i++)
{
 # pragma unroll 4
  for (j = 0; j < M; j++)
  {
    data = input;
  }
}

In this case, you will get one coalesced port to off-chip memory, but with a width of 128 bits. However, if the accesses are as follows:


__global float  input;
__local float data;
for (i = 0; i < N; i++)
{
 # pragma unroll 4
  for (j = 0; j < M; j++)
  {
    data = input;
  }
}

You will get four non-coalesced 32-bit ports to external memory. In this case, your external memory performance will hardly improve since you now have 4 accesses competing with each other to acquire the memory bus.

To maximize memory performance, you should minimize the number of accesses, but maximize the size of the accesses.

Altera_Forum · ‎03-24-2018

thanks for your reply,

Actually my inner loop has a large and consecutive memory access and as reported by optimizer, it is pipelined well with II=1.

I put# pragma unroll 4 on the outer loop (not inner one as you did), hoping to have 4 parallel accesses using 4 memory ports, because outer loop body is independent in different iterations (no read after write). Area size increased by 4 (both logic and BRAMs which I think BRAMs are used as cache for global memory), then I guess there exist 4 memory ports replicated. But performance does not change.

Do you have any guess? my guess is somehow memory accesses are done serially. not in parallel.

# pragma unroll 4

for (unsigned i = 0;i < 4000000; i++)

{

acc = 0.0;

si = start_index;

ei = end_index;

for(unsigned j = si;j < ei;++j) //pipelined with II=1

acc += value[j]; // target memory access

value_next[i] = acc ;

}

Altera_Forum · ‎03-25-2018

There is no point in unrolling the outer loop. You have many global memory accesses as it is, and each access requires its own port to memory. Unrolling the outer loop results in 4 times more ports, and you will have 16 memory accesses competing with each other to acquire the memory bus, resulting in extremely poor memory performance. Assuming that "acc" and "value" are floating-point, you should first use the shift register optimization for floating-point accumulation as outlined in Intel's documents, and then unroll the inner loop to be able to achieve higher performance.

Most FPGA boards only have two memory banks, which means at each clock cycle, you can perform a maximum of two "parallel" memory accesses. You should actually avoid "parallel" memory accesses as much as you can and instead, try to have few but large coalesced accesses.

Altera_Forum · ‎03-25-2018

actually my inner loop has large enough burst read. In the profiler, I see cache hit rate is almost 1 and memory access efficiency is 100%. In another level, I want to have multiple parallel access to improve the bandwidth.

1- First let me ask, if there are multiple accesses in different parts of code to a same global memory variable, like

line 100: X = GlMem;

line 200: y = glmem[j];

does it lead to port replication? or access are done serially through a single port to glmem variable? according to your statement, ports are replicated, right?

2- if so, is it logical to manually unroll the loop, taking care about port replication points? unroll those part i need, and leave the rest rolled? for example, instead of :

# pragma unroll 4

for (unsigned i = 0;i < 4000000; i++)

{

// i,j, acc, s, e are local, rest are global.)

acc = 0.0;

s = start_index;

e = end_index;

for(unsigned j = s;j < e;++j)

acc += value[j]; // target memory access

value_next = acc ;

}

I do this:

for (unsigned i = 0;i < 4000000; i=i+4 ){

// kept rolled

for(unsigned j = 0; j < 4; j++){

acc[j] = 0.0;

s[j] = start_index[i+j];

e[j] = end_index[i+j];

}

// unrolled, I want to improve performance of reading value[j] variable.

for(unsigned j = s[0];j < e[0];++j) // a large burst access

acc[0] += value[j];

for(unsigned j = s[1];j < e[1];++j) // a large burst access

acc[1] += value[j];

for(unsigned j = s[2];j < e[2];++j) // a large burst access

acc[2] += value[j];

for(unsigned j = s[3];j < e[3];++j) // a large burst access

acc[3] += value[j];

// kept rolled

for(unsigned j = 0; j < 4; j++){

value_next[i+j] = acc[j];

}

Thanks a lot for your help :)

Altera_Forum · ‎03-25-2018

1- Every single access to global memory in any part of the code will have its own port to external memory. Such ports are never shared between different accesses, unless they can be coalesced at "compile time".

2- None of those accesses will be "large burst accesses", they will all be 32-bit accesses. You might think that since your accesses are consecutive and in a loop, they might be coalesced into a larger access at runtime, but this will not happen, since the memory interface does not perform runtime coalescing. The only way to achieve higher memory performance is to perform compile-time coalescing by unrolling the loop over the consecutive access.

The "System viewer" in the area report shows all the ports, their size and type. If you unroll the outer loop, you will get four 32-bit accesses, while if you unroll the inner loop, you will get one 128-bit access. The latter will give you far much better memory performance (but an II of 16 due to the floating-point accumulation which can be fixed using shift register inference).

Altera_Forum · ‎03-26-2018

hello,

I really thank you very much for following this thread. I really learn things that I could not find in Altera manuals. I went and checked the manuals again.

- about floating point accumulator : compiler doesn't complain like those examples in "best practice guide", and then I think it is implemented in single-cycle, as specifically explained in programming guide section 1.6.12 (single-cycle FP accumulator, I use Arria 10 device).

lets divide my question into two parts.

1- in a task, I want to have max memory bandwidth from single port for a specific global variable array.

I guess in ideal case (burst, cache,......, but without coalescence), if I get one 32-bit data (of that variable) per cycle, I am done. right?

(Unfortunately the index of consecutive access are random -dynamic indexing- and then I think I can not have coalescence access. To not confuse you, I will open a new thread about the code.)

2- in the same task (No ND-range), in another level, I want to have parallel access to memory by more than one port (for that specific variable), to saturate whole of global memory bandwidth. I want to somehow implement multi compute-unit, but inside one task. Does it make sense to unroll-outer loop for this purpose? Do you have any other suggestion? lets assume coalescence access will not work, due to random memory access.

thanks

Altera_Forum · ‎03-26-2018

--- Quote Start ---

- about floating point accumulator : compiler doesn't complain like those examples in "best practice guide", and then I think it is implemented in single-cycle, as specifically explained in programming guide section 1.6.12 (single-cycle FP accumulator, I use Arria 10 device).

--- Quote End ---

Yes, if you are targeting Arria 10 you can take advantage of single-cycle floating-point accumulation and in fact, your code is constructed in a way that takes advantage of this feature. I think I compiled your code against Stratix V and hence, I got an II of 16 when unrolling the inner loop. If you get an II of one after unrolling the inner loop, then that is good and you can go ahead with unrolling it without needing to use the shift register optimization.

--- Quote Start ---

1- in a task, I want to have max memory bandwidth from single port for a specific global variable array.

I guess in ideal case (burst, cache,......, but without coalescence), if I get one 32-bit data (of that variable) per cycle, I am done. right?

(Unfortunately the index of consecutive access are random -dynamic indexing- and then I think I can not have coalescence access. To not confuse you, I will open a new thread about the code.)

--- Quote End ---

Actually, no, you need a much larger access. I am not sure which Arria 10 board you are using but the one I use has two banks of DDR4 memory running at 2133 MHz. Each bank is connected to the FPGA through a 64-bit bus (72-bit with ECC). Also the memory controller on the FPGA is running at 266 MHz. Hence, if you want to saturate the memory bandwidth using one access, considering the operating frequency difference between the memory controller and the memory itself, you need an access size equal to:

(memory frequency/controller frequency) * number of banks * width of memory port = (2133/266) * 2 * 64 = 1024 bits = 32 floats --> unroll factor of 32 on the memory access loop

Of course this is with the assumption that you only have one access in your kernel. If you have more accesses, then you should divide your bandwidth between the different access, and adjust the unrolling factor accordingly. You can also consider "sacrificing" your less important accesses by using a lower unroll factor for them. Doing one 32-bit access per cycle will only give you 1/32 of the memory bandwidth. Having multiple 32-bit accesses per cycle will give you some limited performance improvement, but you will never reach the peak or anywhere close to it due to contention on the memory bus.

--- Quote Start ---

2- in the same task (No ND-range), in another level, I want to have parallel access to memory by more than one port (for that specific variable), to saturate whole of global memory bandwidth. I want to somehow implement multi compute-unit, but inside one task. Does it make sense to unroll-outer loop for this purpose? Do you have any other suggestion? lets assume coalescence access will not work, due to random memory access.

--- Quote End ---

Yes, unrolling the outer loop will create a multiple-compute-unit-like design; however, you will likely not get much of a performance improvement. The memory controller on the FPGA is extremely inefficient for random memory accesses and due to lack of a proper cache hierarchy, there is very little you can do when it comes to random accesses. The same issue more or less also exists on GPUs; however, GPUs have a far more advanced memory controller and far much higher memory bandwidth and they do also have a cache hierarchy.

If you are interested in understanding the memory performance, you can try porting the following repository for FPGAs:

https://github.com/uob-hpc/babelstream

Then you can play around with different SIMD factors for NDRange and unroll factors for single work-item to see how the memory performance varies. I ported an older version of this repository for the same purpose a while back which you can use (but no documentation or anything):

https://github.com/zohourih/gpu-stream