Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
Intel Support hours are Monday-Fridays, 8am-5pm PST, except Holidays. Thanks to our community members who provide support during our down time or before we get to your questions. We appreciate you!

Need Forum Guidance? Click here
Search our FPGA Knowledge Articles here.
15478 Discussions

Replication of single work item kernel to increase the performance

Honored Contributor II

Hi Everyone, 

I have single Work item kernel which consumes less resource on board. Now I want to replicate the same kernel 2 or 4 times on to the board to improve the performance.  

I have a for loop which runs for 2400 times, now I want to divide the loop into two/four compute units, so that each CU can do a loop of 1200/600 iterations. 

NOTE: I can't use NDRange kernel for dependencies in my loop. 


I have explored the following options from Intel programming guide. 

1. num_compute_units: Have increased the compute units from 1 to 2 for single work item kernel, resource got increased but there was no improvement in perfromance. 

Later in forums it was mentioned that "a single work-group kernel (i.e. no local_id in the kernel) will not at all benefit from num_compute_units, which is probably the reason why the original poster could not achieve any performance improvement." 



2. But in the Altera programming guide it says "You can replicate your single work-item OpenCL kernel by including the num_compute_units(X,Y,Z) kernel attribute"  


3. The other option would be to use get_compute_unit, but it requires the kernel to be a autorun and I need to use channels for that which would again increase the resource utilization. The programming guide says "to create compute units that are slightly different from one another but share a lot of common code, call the get_compute_id()" 

In my case all compute units would remain the same and would not differ. 




Can any one please help me on how I can improve my single Work Item kernel performance by replication or any other ways. 





0 Kudos
2 Replies
Honored Contributor II

There are two version of "num_compute_units()"; one is num_compute_units(X) (one dimensional) which is for ndrange kernels, the other one is num_compute_units(X, Y, Z) (three-dimensional) which is for single work-item autorun kernels that need to be coupled with get_compute_id(). The former is the one that is referenced in the forum thread you mentioned (note that I was talking about single work-group and not single work-item kernels in that thread) and the latter is the one that is mentioned in Altera's documents. There is no way to automatically replicate non-autorun single work-item kernels. 


However, in your case, why don't you just partially unroll your loop? Loop unrolling is the most area-efficient way of improving performance.
Honored Contributor II

Hi HRZ, 

Thanks for the clarification, it was really helpful.  

I can't use the partial unroll because i have dependencies in my loops.