Unrolling and used RAMs

Altera_Forum · ‎07-26-2018

Hello,

I'm trying to understand the relationship between channels, unrolls and used RAM (M20K)

For this purpose, I've created this simple program composed of three kernels:

- the first inject data into the channel

- the second accumulates it

- the third receives the result of the accumulation and stores it into global memory


__kernel void generator_float_vector(int N){
    int outer_loop_limit=(int)(N/U);
    //we cannot have double write
    for(int i=0;i<outer_loop_limit;i++)
    {
   # pragma unroll
    for(int j=0;j<U;j++)
        write_channel_intel(channel_float_vector,(float)(1.0));
    }
}
__kernel void consumer(int N)
{
    int outer_loop_limit=(int)(N/U);
    float acc_o=0;
    float x;
    for(int i=0; i<outer_loop_limit; i++)
    {
         float acc=0;
       # pragma unroll
        for(int j=0;j<U; j++)
            x=read_channel_intel(channel_float_vector);
       # pragma unroll
        for(int j=0;j<U; j++)
            acc+=x;
        acc_o+=acc;
    }
    write_channel_intel(channel_float_sink,acc_o);
}
__kernel void sink_single(__global float * restrict out)
{
    float r=read_channel_intel(channel_float_sink);
    *out=r;
}

The first and second kernel exploits unroll (to speedup computation). The unrolling factor is derived by using the constant U.

In the second kernel, I made explicit the read from channel just for readability.

Now by varying the number U, I obtain (in the report) different values in terms of used blocks of RAM (M20K).

The code is compiled with the v18.0 of Quartus for the Arria10 board.

In particular:

U=4 RAM=16 (16 used by sink kernel)

U=8 RAM=17 (1 consumer kernel, 16 sink kernel)

U=16 RAM=21 (5 consumer, 16 sink)

U=32 RAM=38 (22 consumer, 16 sink)

U=64 RAM=70 (54 consumer,16 sink)

I believe that the 16 RAMs used by the sink kernel are due to device RAM interface.

What I can not understand is the amount of RAMs used by the consumer kernel:

from the programming guide, the compiler should try to exploit private memory (register) if the data used is less than 64bytes.

This should correspond to the case with U=16 (being a float of 4 bytes) but it doesn'seem so

starting from U=16, the number of RAMs used increased with U, which should be somehow related to the unrolling

Any suggestions on how to read this numbers?

Thanks

Altera_Forum · ‎07-27-2018

Your code snippet is missing the channel definitions. I defined the channels manually and ran a test on the kernel. From what I can see in the area report, the Block RAMs used in the consumer kernel are used to keep the "state" of the variables in the kernel. There is no direct relationship between the Block RAM consumption for keeping variable states and the unroll factor. These Block RAMs are used to allow correct pipelining and the amount depends on pipeline length, pipeline complexity, number and scope of variables in the loop(s), target operating frequency and probably other stuff. The 16 Block RAMs used in the sink kernel are also used as buffers between the kernel and external memory interface.

Altera_Forum · ‎07-27-2018

Channel definition:


channel float channel_float_sink;
channel float channel_float_vector;

It seems strange to me the direct correlation between the Unrolling Factor and the increase in the number of used RAMs blocks. As long as I increase U, the block usage increases. Which doesn't seem only due to pipelining.

My first guess was that, since M20Ks have only two ports, the more you unroll, the more blocks you need to support concurrent operations.

Altera_Forum · ‎07-27-2018

In the case of your code, none of the buffers/variables in your code are being implemented as Block RAMs. If they were, the compiler would explicitly report the number of reads and writes from and to each buffer and the replication factor required to support these accesses and the number of Block RAMs used for each buffer. However, the report explicitly says the Block RAMs are used to implement the "state" of the kernel. If you check the "System viewer", you will see that by increasing the unroll factor, the latency of the pipeline keeps increasing in the consumer kernel, which means the pipeline is getting longer and more Block RAMs will be required to keep the state of the variables.

Altera_Forum · ‎07-27-2018

So, the additional RAMs (and additional latency) are essentially due to the accumulation. The more you accumulate (increasing U), the more latency you will pay and the more block you will use.

I've noticed also that using -fp-relaxed, both the values (latency and blocks) are lower with respect to the previous case.

Altera_Forum · ‎07-27-2018

Yes, with --fp-relaxed, the compiler creates a balanced tree for the reduction instead of a long chain of additions and hence, the pipeline length and latency will be lower.