Memory replication OpenCL 16.1

Altera_Forum · ‎02-02-2017

I'm stumped by the needless replication of __local memory objects in a trivial OpenCL kernel. The local array has only 1rd and 1wr. I'm not sure what other directive will indicate the absence of multiple workgroups and prevent needless triplication of local memory.

Here's my simple kernel.

__attribute__((reqd_work_group_size(16384,1,1)))

__attribute__((max_work_group_size(16384)))

__kernel void test(__global int* restrict x, __global int* restrict y) {

__local int x_l[16384];

int i=get_global_id(0);

x_l[i] = x[i];

barrier(CLK_LOCAL_MEM_FENCE);

y[i] = x_l[i]*x_l[i];

}

Here's a snippet of the aocl report.html

test.cl:4 (x_l):
- Local memory: Good but replicated.Requested size 65536 bytes (rounded up to nearest power of 2), implemented size 196608 bytes, replicated 3 times total, stall-free, 1 read and 1 write. Additional information:- Replicated 3 times to efficiently support multiple simultaneous workgroups. This replication resulted in 4 times increase in actual block RAM usage. Reducing the number of barriers or increasing max_work_group_size may help reduce this replication factor.

Altera_Forum · ‎02-02-2017

AOC does not understand the concept of "single work-group" (not to be confused with "single work-item") kernels, even if you only have get_local_id() and nothing else. By default, it always assumes a certain number of work-groups are going to run in parallel in each compute unit and hence, you always get some extra replication factor for local memory instances (usually 3 but I have seen up to 45!!!) that you might not need. From best practices guide v16.0, page 1-30 (https://www.altera.com/en_us/pdfs/literature/hb/opencl-sdk/archives/ug-aocl-best-practices-guide-16.0.pdf) (removed in 16.1):

--- Quote Start ---

Number of simultaneous work-groups is the maximum number of work-groups that the kernel can process at the same time. To increase throughput, the kernel might execute threads from different work-groups simultaneously (that is, that kernel does not wait to fully complete one work-group before starting another work-group). If a kernel can process multiple simultaneous work-groups and has local memory, the size of the local memory must increase to store data from each simultaneous work-group. This local memory replication might increase the usage of block RAM.

currently, you do not have the ability to modify the number of simultaneous work-groups directly.

--- Quote End ---

I don't think there is any work-around for this (at least not a public one) but you could probably open a service ticket directly with Altera and ask them.

Altera_Forum · ‎02-03-2017

The barrier is needless here. Removing it (as suggested by the hint) might fix your problem.