Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Altera_Forum
Honored Contributor I
986 Views

Why does one single load/store consume much RAMs?

Hi, 

when I design OpenCL and run aocl, it reports that one single load consume 13 RAMs and one single store consume 16 RAMs. 

 

e.g  

__kernel void top_kernel(__global restrict volatile int *a, __global restrict volatile int *b, __global restrict volatile int *c) { 

int i; 

for (i = 0; i < 10000; ++i) 

c = a + b[i]; // it will consume 13x2 + 16 RAMs 

}
0 Kudos
6 Replies
Altera_Forum
Honored Contributor I
41 Views

The RAMs are used as FIFO buffers to minimize the negative effect of stalls caused by off-chip memory accesses which can have variable latency. There are also some extra RAMs used per off-chip memory access as a private cache, but since your kernel arguments have been defined as volatile, that cache will not be used.

Altera_Forum
Honored Contributor I
41 Views

 

--- Quote Start ---  

The RAMs are used as FIFO buffers to minimize the negative effect of stalls caused by off-chip memory accesses which can have variable latency. There are also some extra RAMs used per off-chip memory access as a private cache, but since your kernel arguments have been defined as volatile, that cache will not be used. 

--- Quote End ---  

 

 

Thanks for your reply. 

 

Can we disable or decrease the RAM usage?
Altera_Forum
Honored Contributor I
41 Views

Other than disabling the cache using volatile, there is nothing else that can be done. Is there a specific reason you want to do this? Removing those FIFOs will have a very large negative impact on performance.

Altera_Forum
Honored Contributor I
41 Views

 

--- Quote Start ---  

Other than disabling the cache using volatile, there is nothing else that can be done. Is there a specific reason you want to do this? Removing those FIFOs will have a very large negative impact on performance. 

--- Quote End ---  

 

I also saw that using volatile can disable the cache.  

 

I want to apply coarse grained parallel on external memory access, e.g. 

# pragma unroll 

for (i = 0; i < 64; ++i) 

for (j = 0; j < 1000; ++j) 

... = a[i * 1000 + j]; 

 

It will consume a lot of RAMs (64 * 16 RAMs) while total of RAMs is about 2700 RAMs in arria10.  

After that, we have little optimization space because of lack of RAMs. 

 

Thanks.
Altera_Forum
Honored Contributor I
41 Views

Considering the very low memory bandwidth on current FPGA boards and the very high overhead of contention for off-chip memory accesses, you should actually avoid having parallel memory accesses and instead, unroll your memory accesses in a way that they will be coalesced into bigger ones, to minimize the number of ports to external memory. These ports, as you have noticed, waste a lot of space on the FPGA. 

 

In your code example, you are unrolling the i loop, while the memory accesses are not contiguous over the i dimension and hence, you get 64 memory ports. This, apart from very high area usage, will lower your memory bandwidth to near-zero due to constant contention between all those ports. However, if you partially unroll the j loop 64 times, since the accesses are contiguous, you will get a few large coalesced ports with very low area overhead, and you will get very close to theoretical memory bandwidth.
Altera_Forum
Honored Contributor I
41 Views

 

--- Quote Start ---  

Considering the very low memory bandwidth on current FPGA boards and the very high overhead of contention for off-chip memory accesses, you should actually avoid having parallel memory accesses and instead, unroll your memory accesses in a way that they will be coalesced into bigger ones, to minimize the number of ports to external memory. These ports, as you have noticed, waste a lot of space on the FPGA. 

 

In your code example, you are unrolling the i loop, while the memory accesses are not contiguous over the i dimension and hence, you get 64 memory ports. This, apart from very high area usage, will lower your memory bandwidth to near-zero due to constant contention between all those ports. However, if you partially unroll the j loop 64 times, since the accesses are contiguous, you will get a few large coalesced ports with very low area overhead, and you will get very close to theoretical memory bandwidth. 

--- Quote End ---  

 

 

I totally agree with your suggestion. I have tried and it works. 

 

Thanks.
Reply