Solved: Re: Using a value from the previous iteration

KAkyo · ‎05-09-2019

Hello,

I am trying to implement an application on OpenCL, as a single work-item kernel. The below is code snippet and the line numbers in the report are changed to fit with the snippet.

unsigned dvid = 0;
unsigned end_dvid = endIndices[dvid + 1];
for(unsigned ej = start_of_e_chunk; ej < end_of_e_chunk; ej++) 
{
	ovid = ovid_of_edge[ej];
	if (ej == end_dvid - 1) {			
		isSelected[dvid] = 0;
		dvid++;
		end_dvid = endIndices[dvid + 1];
	}		
}

When I compile this, in the report:

The kernel is compiled for single work-item execution.
 
Loop Report:
 
 + Loop "Block1" (file device_single.cl line 37)
   Pipelined with successive iterations launched every 448 cycles due to: 
        
       Data dependency on variable end_dvid  (file device_single.cl line 2)
       Largest Critical Path Contributor:
           99%: Load Operation  (file device.cl line 9)

I understand that when I try to use a value(dvid) from previous iteration, that iteration must finished until the value can be used in the next iteration. Here, dvid's value is incremented by 1 and it is used to read the data from endIndices.

What I am asking is, is there any way to use that value from previous iteration but make the initiation interval = 1 still?

HRZ · ‎05-10-2019

If you move the load on line 9 outside of the if condition, your II will be reduced to a smaller value, at the cost of higher memory traffic since the load will happen every iteration. However, since the address to load depends on dvid and dvid is incremented conditionally, the II cannot be improved much further. Another thing you can try to further reduce the II is to split your input into multiple chunks, load one chunk into a local variable, perform all the computation on that chunk using local variables, then write back the results of the whole chunk to global memory. Whether implementing this will be possible or not depends on your application. However, best case scenario, you might be able to reduce the II to ~10 this way. For such applications, NDRange will probably work better since the work-item scheduler can potentially achieve a lower average II at run-time than the fixed II of the single work-item equivalent, but at the end of the day your performance will be limited by the random global memory accesses and will be quite poor on FPGAs.

View solution in original post

HRZ · ‎05-10-2019

If you move the load on line 9 outside of the if condition, your II will be reduced to a smaller value, at the cost of higher memory traffic since the load will happen every iteration. However, since the address to load depends on dvid and dvid is incremented conditionally, the II cannot be improved much further. Another thing you can try to further reduce the II is to split your input into multiple chunks, load one chunk into a local variable, perform all the computation on that chunk using local variables, then write back the results of the whole chunk to global memory. Whether implementing this will be possible or not depends on your application. However, best case scenario, you might be able to reduce the II to ~10 this way. For such applications, NDRange will probably work better since the work-item scheduler can potentially achieve a lower average II at run-time than the fixed II of the single work-item equivalent, but at the end of the day your performance will be limited by the random global memory accesses and will be quite poor on FPGAs.

KAkyo · ‎05-13-2019

Thank you,

Using a local buffer and reading the input chunk by chunk improved II much further.