Scaling up vector add example

Altera_Forum · ‎06-25-2015

I'm trying to scale up the vector add example to use more FPGA resources.

Kernel is unmodified:

__kernel void vector_add(__global const float * restrict x, 
                         __global const float * restrict y, 
                         __global float *restrict z)
{
    // get index of the work item
    int index = get_global_id(0);
    // add the vector elements
    z = x + y;
}

I tried increasing the work group size:

__attribute__((reqd_work_group_size(1024,1,1)))

However, aoc reports the same device utilization regardless of the size I use.

The optimization guide implies that by specifying a work group size, the compiler will attempt to compile the hardware for that work group size, which scales up the design. Is this true?

Alternatively, I can vectorize or increase the number of compute units to scale up the design, but vectorization is limited to 16 (num_simd_work_items) and compute units seem to come with a

lot of overhead.

So:

What does the reqd_work_group_size attribute do exactly?

What's the best way to scale up a simple ND kernel like this?

Altera_Forum · ‎06-25-2015

So the work group size specifies how many work items each work group handles. It's essentially a way to partition the work-items that you need to process, not necessarily changing the overall hardware of the kernel. By partition your work-items into work groups, the work-items can communicate with one another using local memory that is shared between them.

The ways to speed it up that i am aware of is like what you mentioned: to increase the number of compute units or by specifying the number of simd work items. The thing that you have to realize is that the data is all coming from global memory and the data is accessed one by one. Depending on the application, it can be either compute or memory bound. Since your kernel is a simple vector add, it easily becomes a memory bounded problem since the computation is trivial and can compute the result faster than it can acquire it.

EDIT: One thing I also what to add is that you can also try experimenting with loop unrolling. Loop unrolling (as long as it's data independent) creates essentially multiple instances of the computation under the for loop. However, realize that this could impact memory access efficiency since each of these computation requires a load and store operation from global memory.