How to improve memory bandwidth

Altera_Forum · ‎06-28-2017

Hi, I want to do empty kernels.

First block read data from memory, send this data to channel.

Second block receive data from channel and write this data to memory.

I execute this code on Arria10 DeVKit

But information of bandwidth in profile is just ~3000 Mb/sec, but max bandwidth is 12800 Mb/sec

How can I improve this result ?


# pragma OPENCL_EXTENSION cl_altera_channels : enable
# define THREAD_SIZE 1# define VEC_SIZE 128
typedef struct{
      float data;
} lane_data;
channel lane_data data_ch;
__kernel
void memRead(
                     int data_dim1,
                     int data_dim2,
                     int data_dim3,
                     __global float *restrict bottom
)
{
lane_data buff;
int size = data_dim1*data_dim2*data_dim3;
for(int ll=0; ll<size/THREAD_SIZE/VEC_SIZE; ll++){
   # pragma unroll
    for(int i=0; i<THREAD_SIZE; i++){
       # pragma unroll
        for(int j=0; j<VEC_SIZE; j++){
            buff.data = bottom;
        }
        write_channel_altera(data_ch, buff);
    }
}
}
__kernel
void memWrite(
                     int data_dim1,
                     int data_dim2,
                     int data_dim3,
                     __global float *restrict result
)
{
lane_data buff;
int size = data_dim1*data_dim2*data_dim3;
for(int ll=0; ll<size/THREAD_SIZE/VEC_SIZE; ll++){
   # pragma unroll
    for(int i=0; i<THREAD_SIZE; i++){
        buff = read_channel_altera(data_ch);
       # pragma unroll
        for(int j=0; j<VEC_SIZE; j++){
             result = buff.data;
        }
    }
}
}

thank you

Altera_Forum · ‎06-29-2017

If you take a look at the report, the reason is pretty obvious. The compiler is being stupid and splitting your read access into 8x 512-bit simple accesses and also 3x 32-bit prefetching accesses (no idea what the hell this is), instead of inferring a single 4096-bit coalesced access like the write one. Because of this, you have 12 ports going to memory instead of 2. It goes without saying that this configurations results in a huge amount of contention on the memory bus and significantly reduces your memory performance.

If you add the volatile tag to your input (__global volatile float *restrict bottom), you will also get one single 4096-bit access for the read which will likely allow you to achieve close to peak performance.

Needless to say, since the devkit only has one memory bank, you should be able to achieve full bandwidth with a total access size of 512 bits (read + write), so a vector size of 8 or 16 should be enough in your case.

Altera_Forum · ‎09-14-2017

Thank you, HRZ

But if I use volatile (__global volatile float *restrict bottom). I compiler always can't to generate aocx file on the final stage of compilation.

I get next error

Error: Specified licence does not contain information required to run the Quartus Prime software.

Error: Quartus Prime Compiler Database Interface was unsuccessful

If I delete this tag compilation is become successful.

Altera_Forum · ‎09-14-2017

That is very strange, I have never seen such thing before. I recommend opening a service request with Altera and asking about your license issue; I am afraid they are the only ones who can help you with license issues.