Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
Announcements
Intel Support hours are Monday-Fridays, 8am-5pm PST, except Holidays. Thanks to our community members who provide support during our down time or before we get to your questions. We appreciate you!

Need Forum Guidance? Click here
Search our FPGA Knowledge Articles here.
15463 Discussions

Using a value from the previous iteration

KAkyo
Beginner
423 Views

Hello,

 

I am trying to implement an application on OpenCL, as a single work-item kernel. The below is code snippet and the line numbers in the report are changed to fit with the snippet.

unsigned dvid = 0; unsigned end_dvid = endIndices[dvid + 1]; for(unsigned ej = start_of_e_chunk; ej < end_of_e_chunk; ej++) { ovid = ovid_of_edge[ej]; if (ej == end_dvid - 1) { isSelected[dvid] = 0; dvid++; end_dvid = endIndices[dvid + 1]; } }

When I compile this, in the report:

The kernel is compiled for single work-item execution.   Loop Report:   + Loop "Block1" (file device_single.cl line 37) Pipelined with successive iterations launched every 448 cycles due to: Data dependency on variable end_dvid (file device_single.cl line 2) Largest Critical Path Contributor: 99%: Load Operation (file device.cl line 9)

I understand that when I try to use a value(dvid) from previous iteration, that iteration must finished until the value can be used in the next iteration. Here, dvid's value is incremented by 1 and it is used to read the data from endIndices.

 

What I am asking is, is there any way to use that value from previous iteration but make the initiation interval = 1 still?

0 Kudos
1 Solution
HRZ
Valued Contributor II
129 Views

If you move the load on line 9 outside of the if condition, your II will be reduced to a smaller value, at the cost of higher memory traffic since the load will happen every iteration. However, since the address to load depends on dvid and dvid is incremented conditionally, the II cannot be improved much further. Another thing you can try to further reduce the II is to split your input into multiple chunks, load one chunk into a local variable, perform all the computation on that chunk using local variables, then write back the results of the whole chunk to global memory. Whether implementing this will be possible or not depends on your application. However, best case scenario, you might be able to reduce the II to ~10 this way. For such applications, NDRange will probably work better since the work-item scheduler can potentially achieve a lower average II at run-time than the fixed II of the single work-item equivalent, but at the end of the day your performance will be limited by the random global memory accesses and will be quite poor on FPGAs.

View solution in original post

2 Replies
HRZ
Valued Contributor II
130 Views

If you move the load on line 9 outside of the if condition, your II will be reduced to a smaller value, at the cost of higher memory traffic since the load will happen every iteration. However, since the address to load depends on dvid and dvid is incremented conditionally, the II cannot be improved much further. Another thing you can try to further reduce the II is to split your input into multiple chunks, load one chunk into a local variable, perform all the computation on that chunk using local variables, then write back the results of the whole chunk to global memory. Whether implementing this will be possible or not depends on your application. However, best case scenario, you might be able to reduce the II to ~10 this way. For such applications, NDRange will probably work better since the work-item scheduler can potentially achieve a lower average II at run-time than the fixed II of the single work-item equivalent, but at the end of the day your performance will be limited by the random global memory accesses and will be quite poor on FPGAs.

KAkyo
Beginner
129 Views

Thank you,

Using a local buffer and reading the input chunk by chunk improved II much further.

Reply