Community
cancel
Showing results for 
Search instead for 
Did you mean: 
mvemp
Novice
2,276 Views

What happens to global memory bandwidth when Multiple OpenCL kernels read and write to DRAM simultaneously?

Jump to solution

Hello,

 

I have a basic question about how opencl compiler handles global memory access across different opencl kernels.

 

For eg:-

__kernel input1( __global int *r1 ) {

 

}

 

__kernel input2(__global char*r2) {

 

}

 

__kernel output(__global short *r3) {

 

}

 

When I launch the above three kernels parallely, does the memory bandwidth gets shared equally or does it depend on how many memory access? Does it depend on the size of the data type?

 

In general, what is the maximum number of read and write ports froim FPGA to Global memory?

 

0 Kudos
1 Solution
HRZ
Valued Contributor II
91 Views

First you have to pay attention to the fact that you are calculating the throughput of your kernel in GiB/s, but comparing it with the theoretical peak throughput of the board in GB/s. The peak throughput of the board in GiB/s is around 17.9 GiB/s. The board indeed has two memory banks; however, only the DDR4 bank is supported in the OpenCL BSP. Unless you are willing to modify the BSP yourself to add support for the DDR3 bank, you are not going to be able to use it with OpenCL.

 

I have a set of recommendations that might help you get closer to the peak throughput:

 

1- Your kernel run time is too short to allow accurate timing measurement. Chances are, a big portion of the time you are measuring is the kernel launch overhead. I recommend increasing your input size so that kernel run time is at least a few hundred milliseconds.

 

2- Make sure you are only timing the kernel execution, and the functions used to set the kernel arguments or transfer data between the host and the device are outside of the timing region.

 

3- Try reducing your vector size to 32 or 64 to avoid extra contention on the memory bus.

 

4- Try merging your two kernels into one or increasing your channel depth to avoid possible pipeline stalls caused by the channels.

View solution in original post

5 Replies
HRZ
Valued Contributor II
91 Views

Memory bandwidth used by each access most definitely depends on the size and frequency of each access. However, when you have multiple accesses in different kernels (rather than multiple accesses in the same kernel), the accesses are more likely to collide and result in wasted bandwidth since the compiler does not optimize accesses across kernels.

 

I don't think there is any limit to the number of ports from the FPGA to external memory. But of course the memory bandwidth is limited and if it gets saturated, having more ports will result in more contention on the memory bus and lower performance.

mvemp
Novice
91 Views

When I try compiling this code, I dont get maximum mem BW. I am using Arria 10 GX which supports upto 19.2GB/s.

 

#define USE_ROM 

#pragma OPENCL EXTENSION cl_intel_channels : enable

typedef char DPTYPE;

 

#define VEC_SIZE 512

 

typedef struct {

  DPTYPE data[VEC_SIZE];

} lane_data;

 

 

channel lane_data  data_ch  __attribute__((depth(0)));

 

 

__kernel

__attribute__((task))

__attribute__((max_global_work_dim(0)))

void FetchData(

 

__global lane_data *restrict bottom

 

)

 

{

lane_data data_vec;

 

for(unsigned int win_itm_xyz=0; win_itm_xyz< 13 * 13 * 4096/(VEC_SIZE); win_itm_xyz++){

data_vec = bottom[win_itm_xyz];

write_channel_intel(data_ch, data_vec);

 

 

}

 

 

}

 

__kernel

__attribute__((task))

__attribute__((max_global_work_dim(0)))

void WriteBack(

        __global lane_data *restrict top

)

{

 

 

lane_data output;

for(uint dd = 0; dd< 13 * 13 * 4096/VEC_SIZE; dd++){

output = read_channel_intel(data_ch);

top[dd] = output;  

 

}  

}  

 

 

The Profiler shows 27.9% stall, 1.8%occupancy and ~3000MB/s in the line "data_vec = bottom[win_itm_xyz];"

 

The Profiler shows 51.25% stall, 5.8%occupancy and ~10000MB/s in the line

"  top[dd] = output;"  

 

How can I further improve my memBW in this case?

 

 

 

 

 

HRZ
Valued Contributor II
91 Views

Your vector size is too large. Since your board only has one memory bank, depending on the kernel operating frequency, the memory bandwidth will be saturated with a vector size of 32 or 64. You should, however, be getting close to the peak memory throughput of the device as it is. Why do you say you are not getting the maximum bandwidth? The numbers reported by the profiler are far from reliable. To see whether you are getting the maximum memory bandwidth or not, measure your kernel run time (with profiling disabled) and divide the total amount of data transferred between the FPGA and its memory in your code by the run time, to get the kernel throughput.

 

The following is a thread in which I had previously talked about memory throughput

 

https://forums.intel.com/s/question/0D50P00003yyTckSAE/loopunrolling-and-memory-access-performance?l...

mvemp
Novice
91 Views

Hello HRZ,

 

My kerenl run time is around 0.104ms with profile option removed.

 

which means, my Mem BW is (2x13x13x4096 Bytes)/0.104ms = 13.31GB/sec (Read + Write Peak BW). But I think there is still some room for improvement. Is this because of memory arbitration between read and write transfers?

 

Do you think the Arria 10 GX dev kit has provision to run with 2 DDR banks? Because in the spec page they mention

"2GB DDR4 SDRAM, 2GB DDR3 SDRAM, and RLDRAM3 (16 Meg x 36) daughtercards"

(https://www.intel.com/content/www/us/en/programmable/products/boards_and_kits/dev-kits/altera/kit-a1...)

 

 

 

 

 

HRZ
Valued Contributor II
92 Views

First you have to pay attention to the fact that you are calculating the throughput of your kernel in GiB/s, but comparing it with the theoretical peak throughput of the board in GB/s. The peak throughput of the board in GiB/s is around 17.9 GiB/s. The board indeed has two memory banks; however, only the DDR4 bank is supported in the OpenCL BSP. Unless you are willing to modify the BSP yourself to add support for the DDR3 bank, you are not going to be able to use it with OpenCL.

 

I have a set of recommendations that might help you get closer to the peak throughput:

 

1- Your kernel run time is too short to allow accurate timing measurement. Chances are, a big portion of the time you are measuring is the kernel launch overhead. I recommend increasing your input size so that kernel run time is at least a few hundred milliseconds.

 

2- Make sure you are only timing the kernel execution, and the functions used to set the kernel arguments or transfer data between the host and the device are outside of the timing region.

 

3- Try reducing your vector size to 32 or 64 to avoid extra contention on the memory bus.

 

4- Try merging your two kernels into one or increasing your channel depth to avoid possible pipeline stalls caused by the channels.

View solution in original post

Reply