Example showing multiple compute units providing speedup

Altera_Forum · ‎03-17-2016

Hi,

Is there an example of a kernel that performs better with the use of multiple compute units? I have experimented on various kernels, the simplest of all is a floating point multiplication of an array of double precision floats. I haven’t obtained any kernel that actually improves in terms of execution time from __attribute((num_compute_units(N))). My profiler tells me that my global memory bandwidth is severely affected after using this OpenCL attribute but my access patterns aren’t too complex. Does Altera have a working example of a kernel that benefits from multiple compute units? Has anyone gotten it to work before? Appreciate any feedback or examples that show performance benefits from __attribute((num_compute_units(N))).

This is an example of what I tried to run on AOCL. This code’s global memory access is 2200MB/s with a single compute unit. When I use two compute units, my bandwidth drops to 480MB/s. Why is there such a vast difference?

__attribute__((num_compute_units(CU)))

__kernel void vector_add(__global double * restrict x)

{

// get index of the work item

int id = get_global_id(0);

x[id] = x[id] * x[id];

}

I have tried the SIMD option, which improves performance, but I want to test the performance of multiple compute units, which thus far has been unsuccessfully.

Would appreciate a simple and straightforward working example to build/investigate on.

Altera_Forum · ‎04-20-2017

@nadram and forum admins: Have you received any example of multiple compute units? I am interested the most in how to allocate input data in global memory for individual compute units. I did some basic kernel simulations in Modelsim and it looks like all compute units try to read the same base address on global DDR bus. This would imply that they can only work on the same input data. Of course my simulation setup can be wrong. It would be good to have such example because there are quite a few questions regarding use of num_compute_units seem have be not been answered here for a long time.

Altera_Forum · ‎04-20-2017

num_compute_units is one of the most basic attributes provided by Altera's compiler. Examples of its usage are even available in Altera's OpenCL documents. When num_compute_units is used alongside with NDRange kernels, the compiler will internally replicate the kernel pipeline by the number of times defined by the user, so that multiple work-groups can run in parallel. Altera recommends having at least three times more work-groups than compute unit replicas, to be able to fully utilize the circuit. It goes without saying that all of the replicas are exactly the same, access the same memory buffers, and perform the exact same operations; they just allow the run-time scheduler to schedule more threads in parallel from different work-groups. All these operations are performed automatically and without user intervention.

If you want multiple kernels that perform different operations or access different memory buffers, you have to define them as separate kernels.

P.S. Regarding the original post, I have used num_compute_units numerous times and as long as the on-board memory bandwidth is not saturated, it does certainly lead to performance improvement. The key is to have many work-groups; a single work-group kernel (i.e. no local_id in the kernel) will not at all benefit from num_compute_units, which is probably the reason why the original poster could not achieve any performance improvement.

Altera_Forum · ‎03-08-2018

--- Quote Start ---

Hi,

Is there an example of a kernel that performs better with the use of multiple compute units? I have experimented on various kernels, the simplest of all is a floating point multiplication of an array of double precision floats. I haven’t obtained any kernel that actually improves in terms of execution time from __attribute((num_compute_units(N))). My profiler tells me that my global memory bandwidth is severely affected after using this OpenCL attribute but my access patterns aren’t too complex. Does Altera have a working example of a kernel that benefits from multiple compute units? Has anyone gotten it to work before? Appreciate any feedback or examples that show performance benefits from __attribute((num_compute_units(N))).

This is an example of what I tried to run on AOCL. This code’s global memory access is 2200MB/s with a single compute unit. When I use two compute units, my bandwidth drops to 480MB/s. Why is there such a vast difference?

__attribute__((num_compute_units(CU)))

__kernel void vector_add(__global double * restrict x)

{

// get index of the work item

int id = get_global_id(0);

x[id] = x[id] * x[id];

}

I have tried the SIMD option, which improves performance, but I want to test the performance of multiple compute units, which thus far has been unsuccessfully.

Would appreciate a simple and straightforward working example to build/investigate on.

--- Quote End ---

It is important that you size the work groups properly when applying the compute unit attribute to get increase in performance. The https://www.altera.com/en_us/pdfs/literature/hb/opencl-sdk/aocl-best-practices-guide.pdf speaks in lengths about how to use the compute unit attribute. However, it is still a fact the amount of improvement that you get with the SIMD is always more than that you achieve by replicating compute units. One key reason for this is increase in memory contention. On the other hand as per manual, SIMD option attribute also allows the offline compiler to coalesce memory accesses. This could be a key reason for increase in performance.