Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
Intel Support hours are Monday-Fridays, 8am-5pm PST, except Holidays. Thanks to our community members who provide support during our down time or before we get to your questions. We appreciate you!

Need Forum Guidance? Click here
Search our FPGA Knowledge Articles here.
15479 Discussions

Fully pipelined sparse matrix-vector multiplication(SMVM)


Dear all,


I am trying to implement an application as efficient as possible as a single work-item kernel. And I found out that, my application is very same with SMVM. In this application we have double for loop, outer loop iterates rowCount times, and inner loop iterates #ofNonzeroElementsInRow times. However in this structure, compiler cannot pipeline the structure because of "Out-of-Order Loop Iterations" below:

The kernel is compiled for single work-item execution.   Loop Report:   + Loop "Block1" (file line 34) | NOT pipelined due to: | Loop exit condition unresolvable at iteration initiation. | Simplify loop exit condition to fix this problem. | See "Unable to Resolve Loop Exit Condition at Iteration Initiation" section of the Best Practices Guide for more information. | Not pipelining this loop will most likely lead to poor performance. | | |-+ Loop "Block2" (file line 43) Pipelined well. Successive iterations are launched every cycle.

I searched on the forum for such problems and I found this question. In this article, It is said that, using all the elements with a condition during the iteration of the outer loop to make number of iterations constant. However this yields huge performance loss because of empty cycles.


I thought that even some applications are not easy to solve, well-known application like SMVM should be implemented in the most efficient way.


I couldn't find any pointer to this problem and implementation of SMVM on the internet. My question is, is there any "most-efficient" implementation of this application? Or can "completely pipelining a loop structure with variable number of iteration" be done with some trick or so?



Thank you in advance,

Kaan Akyol

0 Kudos
0 Replies