Replication of single work item kernel to increase the performance

Altera_Forum · ‎03-29-2018

Hi Everyone,

I have single Work item kernel which consumes less resource on board. Now I want to replicate the same kernel 2 or 4 times on to the board to improve the performance.

I have a for loop which runs for 2400 times, now I want to divide the loop into two/four compute units, so that each CU can do a loop of 1200/600 iterations.

NOTE: I can't use NDRange kernel for dependencies in my loop.

I have explored the following options from Intel programming guide.

1. num_compute_units: Have increased the compute units from 1 to 2 for single work item kernel, resource got increased but there was no improvement in perfromance.

Later in forums it was mentioned that "a single work-group kernel (i.e. no local_id in the kernel) will not at all benefit from num_compute_units, which is probably the reason why the original poster could not achieve any performance improvement."

Link: https://www.alteraforum.com/forum/showthread.php?t=51783&highlight=num_compute_units

2. But in the Altera programming guide it says "You can replicate your single work-item OpenCL kernel by including the num_compute_units(X,Y,Z) kernel attribute"

3. The other option would be to use get_compute_unit, but it requires the kernel to be a autorun and I need to use channels for that which would again increase the resource utilization. The programming guide says "to create compute units that are slightly different from one another but share a lot of common code, call the get_compute_id()"

In my case all compute units would remain the same and would not differ.

Can any one please help me on how I can improve my single Work Item kernel performance by replication or any other ways.

Thanks

Altera_Forum · ‎03-29-2018

There are two version of "num_compute_units()"; one is num_compute_units(X) (one dimensional) which is for ndrange kernels, the other one is num_compute_units(X, Y, Z) (three-dimensional) which is for single work-item autorun kernels that need to be coupled with get_compute_id(). The former is the one that is referenced in the forum thread you mentioned (note that I was talking about single work-group and not single work-item kernels in that thread) and the latter is the one that is mentioned in Altera's documents. There is no way to automatically replicate non-autorun single work-item kernels.

However, in your case, why don't you just partially unroll your loop? Loop unrolling is the most area-efficient way of improving performance.

Altera_Forum · ‎04-05-2018

Hi HRZ,

Thanks for the clarification, it was really helpful.

I can't use the partial unroll because i have dependencies in my loops.