Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
17267 Discussions

Why does one single load/store consume much RAMs?

Altera_Forum
Honored Contributor II
1,976 Views

Hi, 

when I design OpenCL and run aocl, it reports that one single load consume 13 RAMs and one single store consume 16 RAMs. 

 

e.g  

__kernel void top_kernel(__global restrict volatile int *a, __global restrict volatile int *b, __global restrict volatile int *c) { 

int i; 

for (i = 0; i < 10000; ++i) 

c = a + b[i]; // it will consume 13x2 + 16 RAMs 

}
0 Kudos
6 Replies
Altera_Forum
Honored Contributor II
1,031 Views

The RAMs are used as FIFO buffers to minimize the negative effect of stalls caused by off-chip memory accesses which can have variable latency. There are also some extra RAMs used per off-chip memory access as a private cache, but since your kernel arguments have been defined as volatile, that cache will not be used.

0 Kudos
Altera_Forum
Honored Contributor II
1,031 Views

 

--- Quote Start ---  

The RAMs are used as FIFO buffers to minimize the negative effect of stalls caused by off-chip memory accesses which can have variable latency. There are also some extra RAMs used per off-chip memory access as a private cache, but since your kernel arguments have been defined as volatile, that cache will not be used. 

--- Quote End ---  

 

 

Thanks for your reply. 

 

Can we disable or decrease the RAM usage?
0 Kudos
Altera_Forum
Honored Contributor II
1,031 Views

Other than disabling the cache using volatile, there is nothing else that can be done. Is there a specific reason you want to do this? Removing those FIFOs will have a very large negative impact on performance.

0 Kudos
Altera_Forum
Honored Contributor II
1,031 Views

 

--- Quote Start ---  

Other than disabling the cache using volatile, there is nothing else that can be done. Is there a specific reason you want to do this? Removing those FIFOs will have a very large negative impact on performance. 

--- Quote End ---  

 

I also saw that using volatile can disable the cache.  

 

I want to apply coarse grained parallel on external memory access, e.g. 

# pragma unroll 

for (i = 0; i < 64; ++i) 

for (j = 0; j < 1000; ++j) 

... = a[i * 1000 + j]; 

 

It will consume a lot of RAMs (64 * 16 RAMs) while total of RAMs is about 2700 RAMs in arria10.  

After that, we have little optimization space because of lack of RAMs. 

 

Thanks.
0 Kudos
Altera_Forum
Honored Contributor II
1,031 Views

Considering the very low memory bandwidth on current FPGA boards and the very high overhead of contention for off-chip memory accesses, you should actually avoid having parallel memory accesses and instead, unroll your memory accesses in a way that they will be coalesced into bigger ones, to minimize the number of ports to external memory. These ports, as you have noticed, waste a lot of space on the FPGA. 

 

In your code example, you are unrolling the i loop, while the memory accesses are not contiguous over the i dimension and hence, you get 64 memory ports. This, apart from very high area usage, will lower your memory bandwidth to near-zero due to constant contention between all those ports. However, if you partially unroll the j loop 64 times, since the accesses are contiguous, you will get a few large coalesced ports with very low area overhead, and you will get very close to theoretical memory bandwidth.
0 Kudos
Altera_Forum
Honored Contributor II
1,031 Views

 

--- Quote Start ---  

Considering the very low memory bandwidth on current FPGA boards and the very high overhead of contention for off-chip memory accesses, you should actually avoid having parallel memory accesses and instead, unroll your memory accesses in a way that they will be coalesced into bigger ones, to minimize the number of ports to external memory. These ports, as you have noticed, waste a lot of space on the FPGA. 

 

In your code example, you are unrolling the i loop, while the memory accesses are not contiguous over the i dimension and hence, you get 64 memory ports. This, apart from very high area usage, will lower your memory bandwidth to near-zero due to constant contention between all those ports. However, if you partially unroll the j loop 64 times, since the accesses are contiguous, you will get a few large coalesced ports with very low area overhead, and you will get very close to theoretical memory bandwidth. 

--- Quote End ---  

 

 

I totally agree with your suggestion. I have tried and it works. 

 

Thanks.
0 Kudos
Reply