I am trying to implement an application as efficient as possible as a single work-item kernel. And I found out that, my application is very same with SMVM. In this application we have double for loop, outer loop iterates rowCount times, and inner loop iterates #ofNonzeroElementsInRow times. However in this structure, compiler cannot pipeline the structure because of "Out-of-Order Loop Iterations" below:
The kernel is compiled for single work-item execution.
+ Loop "Block1" (file compute_pagerank_single.cl line 34)
| NOT pipelined due to:
| Loop exit condition unresolvable at iteration initiation.
| Simplify loop exit condition to fix this problem.
| See "Unable to Resolve Loop Exit Condition at Iteration Initiation" section of the Best Practices Guide for more information.
| Not pipelining this loop will most likely lead to poor performance.
|-+ Loop "Block2" (file compute_pagerank_single.cl line 43)
Pipelined well. Successive iterations are launched every cycle.
I searched on the forum for such problems and I found this question. In this article, It is said that, using all the elements with a condition during the iteration of the outer loop to make number of iterations constant. However this yields huge performance loss because of empty cycles.
I thought that even some applications are not easy to solve, well-known application like SMVM should be implemented in the most efficient way.
I couldn't find any pointer to this problem and implementation of SMVM on the internet. My question is, is there any "most-efficient" implementation of this application? Or can "completely pipelining a loop structure with variable number of iteration" be done with some trick or so?