Hi Everyone,I have single Work item kernel which consumes less resource on board. Now I want to replicate the same kernel 2 or 4 times on to the board to improve the performance. I have a for loop which runs for 2400 times, now I want to divide the loop into two/four compute units, so that each CU can do a loop of 1200/600 iterations. NOTE: I can't use NDRange kernel for dependencies in my loop. I have explored the following options from Intel programming guide. 1. num_compute_units: Have increased the compute units from 1 to 2 for single work item kernel, resource got increased but there was no improvement in perfromance. Later in forums it was mentioned that "a single work-group kernel (i.e. no local_id in the kernel) will not at all benefit from num_compute_units, which is probably the reason why the original poster could not achieve any performance improvement." Link: https://www.alteraforum.com/forum/showthread.php?t=51783&highlight=num_compute_units 2. But in the Altera programming guide it says "You can replicate your single work-item OpenCL kernel by including the num_compute_units(X,Y,Z) kernel attribute" 3. The other option would be to use get_compute_unit, but it requires the kernel to be a autorun and I need to use channels for that which would again increase the resource utilization. The programming guide says "to create compute units that are slightly different from one another but share a lot of common code, call the get_compute_id()" In my case all compute units would remain the same and would not differ. Can any one please help me on how I can improve my single Work Item kernel performance by replication or any other ways. Thanks
There are two version of "num_compute_units()"; one is num_compute_units(X) (one dimensional) which is for ndrange kernels, the other one is num_compute_units(X, Y, Z) (three-dimensional) which is for single work-item autorun kernels that need to be coupled with get_compute_id(). The former is the one that is referenced in the forum thread you mentioned (note that I was talking about single work-group and not single work-item kernels in that thread) and the latter is the one that is mentioned in Altera's documents. There is no way to automatically replicate non-autorun single work-item kernels.However, in your case, why don't you just partially unroll your loop? Loop unrolling is the most area-efficient way of improving performance.