Difference between RAM blocks and Memory Bits?

Altera_Forum · ‎02-13-2018

I recently encounter abnormal usage about RAM blocks.

my code is like:

# define A_PARALLEL 16# define B_PARALLEL 32

typedef struct{

float kk[A_PARALLEL];

} A_parallel;

typedef struct{

A_parallel ff[B_PARALLEL];

} B_parallel;

__kernel foo(__global const B_parallel *W, int count){

B_parallel temp;

for(int i=0; i<count ;i++){

temp = W[idx+i];

}

In acl_quartus_report:

ALUTs: 169508

Registers: 286,128

Logic utilization: 131,657 / 427,200 ( 31 % )

I/O pins: 277 / 826 ( 34 % )

DSP blocks: 680 / 1,518 ( 45 % )

Memory bits: 31,849,144 / 55,562,240 ( 57 % )

RAM blocks: 2,458 / 2,713 ( 91 % )

In report.html, it tells most of RAM blocks usage comes from temp.

my question is what is difference between RAM blocks and memory bits?

and my temp only use 512 floats, why it has such large RAM usage?

I want to force it use register with attribute((register)) but can't work, how to solve it?

Altera_Forum · ‎02-13-2018

In report.html it also says:

Load uses a Burst-coalesced cached LSU. Load with a private 512 kilobit cache. Cache is not shared with any other load. It is flushed on kernel start. Use Dynamic Profiler to verify cache effectiveness. Other kernels should not be updating the data in global memory while this kernel is using it. Cache is created when memory access pattern is data-dependent or appears to be repetitive. Simplify access pattern or mark pointer as 'volatile' to disable generation of this cache.

what that mean? I try to mark my temp as

volatile B_parallel temp;

and it give a lot of error while Linking with IP library

Altera_Forum · ‎02-13-2018

I saw someone had same question as mine, the problem might be cached LSU.

However, after use volatile at

__kernel foo(__global volatile const B_parallel *W, int count)

still can't disable cached LSU...

If I load data into local memory first,

What is different between cached LSU and local memory?

Altera_Forum · ‎02-13-2018

An Altera/Intel FPGA has multiple Block RAMs, each with a size of 20Kbits, and 2 ports. "RAM blocks" shows the number of M20K blocks with at least one occupied port. "Memory bits" shows the total number of bits among all the M20K blocks that is occupied by valid data.

512 floats is 16Kbits which is far too big to be implemented using registers, and would require one M20K per access.

External memory accesses are always coupled with a private on-chip cache which could potentially prevent redundant accesses from going to external memory, and improve performance. Mixing "volatile" and "constant" does not make sense; something that is constant cannot be volatile. Caches for non-constant buffers can be disabled by using volatile. For constant buffers, a different cache is used. Check "Intel FPGA SDK for OpenCL Best Practices Guide, Section 7.3.1" for more info about how you can control the size of the constant cache.

If you are manually loading data to local memory, you do not need the cache anymore, and you can disable it.

Altera_Forum · ‎02-13-2018

If you want to disable the cache on global memory LSU you need to mark the "pointer" as volatile and not the data type B_parallel, so instead of

__kernel foo(__global volatile const B_parallel *W, int count)

you're looking for:

__kernel foo(__global const B_parallel * volatile W, int count)

Also note, that const and __constant are different.

Altera_Forum · ‎02-14-2018

Thanks HRZ and fand

I have tried declare volatile at every position, then I find out that in quartus 17.0, volatile won't work if we use a non-volatile variable to catch a volatile variable.

But in quartus 17.1, this won't happend.

won't work:

__kernel foo(__global volatile const B_parallel *W, int count){

B_parallel p = W[idx];

}

work:

__kernel foo(__global volatile const B_parallel *W, int count){

volatile B_parallel p = W[idx];

}

I also want to know relation between RAM blocks and memory bits.

In my code

# define A_PARALLEL 16# define B_PARALLEL 16

typedef struct{

float kk[A_PARALLEL];

} A_parallel;

typedef struct{

A_parallel ff[B_PARALLEL];

} B_parallel;

__kernel fooA(__global const B_parallel *W, int count){

}

__kernel fooB(__global const B_parallel *W, int count){

B_parallel temp[256];

}

fooB use more 2,102,272 memory bits than fooA, which is reasonable because 256 B_parallel equals 256*16*16*32=2,097,152 bits.

However, fooB use more 208 RAM blocks than fooA, 208 RAM blocks equals 4,160,000 bits. how this happened?(report.html says it implement 256 RAM blocks, but real implement is 208 RAM blocks in my compare)

and also I want to know that the report.html says my local memory has 1 read and 1 write.

but every times I use B_parallel, I have read 16*16=256 floats from local memory, so do I need 256 banks that I can read 256 floats from local memory in parallel?

report.html:

Private memory: Optimal

Requested size: 262144 bytes

Implemented size: 262144 bytes

Number of banks: 1

Bank width: 8192 bits

Bank depth: 256 words

Total replication: 1

Additional information: Requested size 262144 bytes, implemented size 262144 bytes, stall-free, 1 read and 1 write.

and Is there any bugs with emulator if I declare local memory too large ?

like if I declare local memory exceed 5000 floats, program will crash when run with emulator. how to solve this problem?

Altera_Forum · ‎02-14-2018

For my kernels using volatile I would normally have something arranged like this:

__kernel foo(__global const B_parallel * volatile W, int count){

B_parallel p = W[idx];

}

which works for me, I'm not exactly sure why it is having you require your variable to be volatile as well, unless it is a pointer to a struct which doesn't appear to be the case.

The memory blocks used will probably be always higher than the number of ideals bits calculated since there is a limit of bits per block, knowing how much more is difficult to find out as the compiler determines what it thinks is optimal.

The 1 read and 1 write says the number of accesses that are made to local memory at a time. If you want to have more accesses, you will need to unroll the loop or have more accesses to local memory in the code. The best practices guide recommends limiting it to four accesses to optimal performance, however, having more accesses will likely create a more complex memory structure and most likely cause duplication which will consume a large amount of memory blocks so that the memory blocks can be widely accessed in parallel. In addition to the memory replication, performance will likely suffer as well. The M20Ks are capable of operating at twice the clock speed of the FPGA clock which can lead to the memory being double pumped allowing it to support double the amount of accesses while keeping up with the FPGA clock.

Banking memory I usually stick with the default as that ends up handling what I need pretty well, in some cases tweaking with the banking can provide some increased performance but that involves some playing around with. I haven't tried banking with a size of 256 parallel accesses but I'd imagine there will be a limit when the memory replication will start impacting performance.

There are several emulator limitations, but it does seem to have "unlimited" memory, or however much your CPU is willing to use since it is actually running on the CPU instead of the FPGA. Although it won't fit on the board, I have ran into a number of bugs where the size was too large, I'm not sure if that size depends on your development environment but in anycase, at that size, it isn't practical to use for an OpenCL FPGA anyways. You do have a limitation on how much local memory you can effectively utilize, if you need more space, you would have to move it into global.

I hope that helps.