Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Altera_Forum
Honored Contributor I
1,104 Views

How to improve memory bandwidth

Hi, I want to do empty kernels. 

First block read data from memory, send this data to channel. 

Second block receive data from channel and write this data to memory. 

 

I execute this code on Arria10 DeVKit 

But information of bandwidth in profile is just ~3000 Mb/sec, but max bandwidth is 12800 Mb/sec 

 

How can I improve this result ?  

 

# pragma OPENCL_EXTENSION cl_altera_channels : enable # define THREAD_SIZE 1# define VEC_SIZE 128 typedef struct{ float data; } lane_data; channel lane_data data_ch; __kernel void memRead( int data_dim1, int data_dim2, int data_dim3, __global float *restrict bottom ) { lane_data buff; int size = data_dim1*data_dim2*data_dim3; for(int ll=0; ll<size/THREAD_SIZE/VEC_SIZE; ll++){ # pragma unroll for(int i=0; i<THREAD_SIZE; i++){ # pragma unroll for(int j=0; j<VEC_SIZE; j++){ buff.data = bottom; } write_channel_altera(data_ch, buff); } } } __kernel void memWrite( int data_dim1, int data_dim2, int data_dim3, __global float *restrict result ) { lane_data buff; int size = data_dim1*data_dim2*data_dim3; for(int ll=0; ll<size/THREAD_SIZE/VEC_SIZE; ll++){ # pragma unroll for(int i=0; i<THREAD_SIZE; i++){ buff = read_channel_altera(data_ch); # pragma unroll for(int j=0; j<VEC_SIZE; j++){ result = buff.data; } } } }  

 

thank you
0 Kudos
3 Replies
Altera_Forum
Honored Contributor I
72 Views

If you take a look at the report, the reason is pretty obvious. The compiler is being stupid and splitting your read access into 8x 512-bit simple accesses and also 3x 32-bit prefetching accesses (no idea what the hell this is), instead of inferring a single 4096-bit coalesced access like the write one. Because of this, you have 12 ports going to memory instead of 2. It goes without saying that this configurations results in a huge amount of contention on the memory bus and significantly reduces your memory performance. 

 

If you add the volatile tag to your input (__global volatile float *restrict bottom), you will also get one single 4096-bit access for the read which will likely allow you to achieve close to peak performance. 

 

Needless to say, since the devkit only has one memory bank, you should be able to achieve full bandwidth with a total access size of 512 bits (read + write), so a vector size of 8 or 16 should be enough in your case.
Altera_Forum
Honored Contributor I
72 Views

Thank you, HRZ 

But if I use volatile (__global volatile float *restrict bottom). I compiler always can't to generate aocx file on the final stage of compilation. 

 

I get next error 

 

Error: Specified licence does not contain information required to run the Quartus Prime software. 

Error: Quartus Prime Compiler Database Interface was unsuccessful 

 

If I delete this tag compilation is become successful.
Altera_Forum
Honored Contributor I
72 Views

That is very strange, I have never seen such thing before. I recommend opening a service request with Altera and asking about your license issue; I am afraid they are the only ones who can help you with license issues.

Reply