Extreme Performance Drop while using local memory

Altera_Forum · ‎06-02-2017

I have two simple kernels both doing the same thing, but the only difference is on of them reads the data from global memory, and the other first copy data into local memory and then read it from there. Here are my two kernels:

#pragma OPENCL EXTENSION cl_khr_fp64: enable

__kernel void Test11(__global float *data, __global float *rands, int index, int rand_max){

float2 temp;

int gid = get_global_id(0);

temp = data[gid];

# pragma unroll

for (int i = 1; i < 500; i++) {

temp = (float) rands * temp;

}

data[gid] = temp.s0;

}

and,

#pragma opencl extension cl_khr_fp64: enable

__kernel void test11(__global float *data, __global float *rands, int index, int rand_max){

float2 temp;

__local float localrands[500];

int depth = 500;

int gid = get_global_id(0);

int lid = get_local_id(0);

int localworksize = get_local_size(0);

int workitemcopyportion = depth / localworksize;

event_t event = async_work_group_copy (localrands, &(rands[lid * workitemcopyportion]), (depth - lid*workitemcopyportion < workitemcopyportion) ? (depth - lid*workitemcopyportion) : workitemcopyportion, 0);

wait_group_events (1, &event);

temp = data[gid];

# pragma unroll

for (int i = 1; i < 500; i++) {

temp = (float) localrands * temp;

}

data[gid] = temp.s0;

}

Talking about the OpenCL parameters, I set the total number of work items as 1048576. Looking at the results I see the kernel which reads directly from global memory can achieve around 227 GFLOPS and the one which first copies data to local memory and then read it achieves around 23 GFLOPS. The group size is also set 128.

Now doing the same thing on GPU, I can say the one utilizing the local memory achieves around 2.5x higher performance. But in FPGA I see severe performance drops. Can anyone help me what is going wrong in this design?

Thanks,

Altera_Forum · ‎06-02-2017

Fully unrolling a loop with 500 iterations does not make any sense since there is simply not enough off-chip memory bandwidth to be able to support the memory accesses; a quick look at the report shows that you are creating 32 read ports because of the unrolled loop and if you profile the kernel, you will probably see a huge amount of stalling in off-chip accesses. Since your kernel has 2 off-chip memory reads and one write per cycle, an unroll factor between 8 to 16 should fully utilize the memory bandwidth. There is also no need to copy anything to local memory in this case since the accesses to the "rands" input are consecutive and can be coalesced at compile time. Your bottleneck here is off-chip memory bandwidth and copying stuff from global to local memory, other than lowering your performance by breaking the compile-time access coalescing and increasing global memory contention and wasting numerous cycles on the barrier, will not do anything else.

Altera_Forum · ‎06-02-2017

Thanks much HRZ for the reply.