Showing results for 
Search instead for 
Did you mean: 
Honored Contributor I

Replication of single work item kernel to increase the performance

Hi Everyone, 

I have single Work item kernel which consumes less resource on board. Now I want to replicate the same kernel 2 or 4 times on to the board to improve the performance.  

I have a for loop which runs for 2400 times, now I want to divide the loop into two/four compute units, so that each CU can do a loop of 1200/600 iterations. 

NOTE: I can't use NDRange kernel for dependencies in my loop. 


I have explored the following options from Intel programming guide. 

1. num_compute_units: Have increased the compute units from 1 to 2 for single work item kernel, resource got increased but there was no improvement in perfromance. 

Later in forums it was mentioned that "a single work-group kernel (i.e. no local_id in the kernel) will not at all benefit from num_compute_units, which is probably the reason why the original poster could not achieve any performance improvement." 



2. But in the Altera programming guide it says "You can replicate your single work-item OpenCL kernel by including the num_compute_units(X,Y,Z) kernel attribute"  


3. The other option would be to use get_compute_unit, but it requires the kernel to be a autorun and I need to use channels for that which would again increase the resource utilization. The programming guide says "to create compute units that are slightly different from one another but share a lot of common code, call the get_compute_id()" 

In my case all compute units would remain the same and would not differ. 




Can any one please help me on how I can improve my single Work Item kernel performance by replication or any other ways. 





0 Kudos
2 Replies
Honored Contributor I

There are two version of "num_compute_units()"; one is num_compute_units(X) (one dimensional) which is for ndrange kernels, the other one is num_compute_units(X, Y, Z) (three-dimensional) which is for single work-item autorun kernels that need to be coupled with get_compute_id(). The former is the one that is referenced in the forum thread you mentioned (note that I was talking about single work-group and not single work-item kernels in that thread) and the latter is the one that is mentioned in Altera's documents. There is no way to automatically replicate non-autorun single work-item kernels. 


However, in your case, why don't you just partially unroll your loop? Loop unrolling is the most area-efficient way of improving performance.
Honored Contributor I

Hi HRZ, 

Thanks for the clarification, it was really helpful.  

I can't use the partial unroll because i have dependencies in my loops.