Does initiation interval of 1 means no need for unrollment?

Altera_Forum · ‎04-17-2018

I have one basic question. In case of Single-Thread mode kernels, if I have one single loop and I'll be able to achieve initiation interval of one, is it still beneficial to unroll the loop? My assumption is, when you have II=1, then at each clock cycles you can go one step forward and it can achieve something almost like the unrollment of for loops.

In order to verify this observation, I have created some synthetic kernels. The kernels schema are like below:

#define LL 16384# define UL 128
__kernel void WGSXMAPIXLL16384OPS8ST( const __global float * restrict GIn,
 __global float * restrict GOut,
 const float M,
 const float N,
 const float P) {
 int Mi = M;
 float tempOutTotal = 0;
# pragma unroll UL
 for (int lcdd =0; lcdd < LL; lcdd++) {
 float tempOut = N + lcdd;
 float temp1 = 1.0;
 //float temp2 = 1.0;
 //float temp3 = 1.0;
 //float temp4 = 1.0;
 float temp = 0.0;
 temp1 += temp1 * tempOut;
 temp1 += temp1 * tempOut;
 ... // Total number of FMA ops
 temp1 += temp1 * tempOut;
 GOut = temp1;
 //tempOutTotal += tempOut;
 }
}

The LL can get values of 16384, 32768, 65536, 131072, 262144, 524288, 1048576. The number of FMA operations is set to 8 and 16. The unrollment factor (UL) is set to 128 and 64, for kernels with 8 and 16 number of FMAs.

Now, when I compile and run these kernels I get below performance numbers in GFLOPS (for GFLOPS I count the total number of operations, divided by the time it takes to finish the kernel.)# OPS=8# OPS=16

LL 16384 14.42 28.27

LL 32768 19.11 51.41

LL 65536 32.86 59.32

LL 131072 36.92 78.40

LL 262144 39.24 98.85

LL 524288 47.32 99.74

LL 1049576 51.26 111.2

Based on the numbers above, I don't understand why even having II=1 is not enough and we need to further unroll the code!! Another thing is, why increasing LL increases the performance? Is the initialization overhead of FPGA is so high that we need large kernel execution to hide that overhead? My assumption was, unlike GPU, FPGA can start using the pipe really fast and should not introduce so much overhead.

Can anyone help me to understand the above observations??

Thanks,

Altera_Forum · ‎04-17-2018

II = 1 means , you can go one step forward at each clock cycles,

unroll can increase number of processing at each step.

so, your total performance is (step/clock)*(process/step) = process/clock

Altera_Forum · ‎04-17-2018

Regarding II and unrolling, in fact you should only use unrolling when you have an II of one. An II of one means there is one loop iteration being processed per clock cycle. When you further unroll the loop by a factor of X in this case, then you will have X iterations being processed per cycle, effectively increasing your performance by a factor of X (if memory bandwidth is not saturated). In cases where II is above II (e.g. due to loop-carried dependencies), then unrolling the loop will increase II by a factor of X, cancelling out the performance improvement from loop unrolling.

Regarding performance improvement with a larger loop trip count; this could be possible if your loop trip count is small compared to the pipeline depth. In this case, the pipeline latency will dominate the run time. However, when the loop trip count is relatively large compared to the pipeline depth, then increasing loop trip count should not affect performance. In your case, based on Altera's report, the pipeline latency is around 60 clocks, which means even your minimum loop trip count of 16384 is large enough to hide the pipeline latency. I believe the performance improvement you are seeing here with higher loop trip count is likely a timing or FLOP/s calculation artifact, rather than a performance artifact. e.g. if you are taking host to device transfer time into account, the higher your loop trip count is, the longer your kernel run time will be and the lower the overhead of the host to device transfer will become; hence, you will get higher performance. Also if your kernel run time is too short, your timing function could be reporting the run time incorrectly, lowering the performance when the loop trip count is small. Make sure your kernels run for a minimum of a few hundred milliseconds.

You can find a general performance model for OpenCL on FPGAs here:

https://dl.acm.org/citation.cfm?id=3014951

In that paper it is explained how II and loop unrolling affects run time. Though the assumption that pipeline depth increases with the loop unroll factor is incorrect; pipeline depth does increase with loop unrolling, but not as much as the unroll factor.

Altera_Forum · ‎04-17-2018

Thanks HRZ for the great response,

Here I have some questions:

1) I totally understood the effect of unrolling, but I wanna know more about its implementation. When you unroll a loop, does that mean it will replicate the loop pipeline structure spatially and can run "X" number of iterations in parallel? This is what comes to my mind right now.

2) You said the latency of my "for" loop is something around 60, but when I unroll the for-loop, the loop trip count does not stay as 16384. In fact, it will be 16384/X. Don't you think for X equal to 128, the result loop trip count is still small to hide the pipeline latency?

Thanks,

Saman

Altera_Forum · ‎04-17-2018

1) I am not exactly sure about the actual way loop unrolling is implemented in Intel's compiler; however, I would guess the pipeline is probably "widened", or as you say, replicated spatially, so that multiple iterations can be computed in parallel.

2) You are actually correct. In fact, I did take that into account, but I didn't actually divide 16384 by 128, thinking it would result in a few thousand iterations which should be enough to hide the pipeline latency. However, now that I put it into a calculator and saw that the number of iterations in the unrolled loop will be only 128, I believe it is safe to say it is not enough to fully hide the pipeline latency and that could be why you see performance improvement with higher loop trip count.

Altera_Forum · ‎04-17-2018

Thanks for the great response,

I also have one more questions. Consider we have II > 1 in our for-loop in the single thread mode kernel. Now let's say there is still some parallelism opportunity in the code and I would like to move into NDRange mode implementation. As far as I know, in NDRange, multiple threads are being scheduled to the for-loop body instead of loop trips. I know that II in NDRange mode is not specified at compile time and will be determined at runtime. But is there any chance that our ND-Range mode can deliver better II, and as a result better performance?

Altera_Forum · ‎04-17-2018

Yes, that is definitely possible. The II in a single work-item kernel is set for the worst case scenario. If you cannot reduce the II to one, implementing your code as an NDRange kernel could achieve a better average II in the end, since the II is adjusted at run-time by the scheduler. This is more or less the only situation where an NDRange implementation is preferred over single work-item.