Achieving parallel execution of loop on FPGA

Mickleman · ‎06-07-2021

I am having difficulty persuading the compiler to execute an outer loop in parallel in an FPGA kernel. I have constructed a simple example to illustrate the issue.

Here is a simple array initialisation loop with a loop carried dependency:

const int ITEM_LENGTH = 1000;
uint16_t a[ITEM_LENGTH];

for (int i = 0; i < ITEM_LENGTH; i++)
if ( i == 0 )
a[i] = j;
else
a[i] = a[i-1] + i;

The compiler correctly schedules this with an II of 6 (at 240MHz on Arria 10).

To improve throughput 10 of these are processed together by adding an outer loop:

const int ITEM_LENGTH = 1000;
const int GROUP_SIZE = 10;
uint16_t a[GROUP_SIZE][ITEM_LENGTH];

for (int j = 0; j < GROUP_SIZE; j++)
for (int i = 0; i < ITEM_LENGTH; i++)
    if ( i == 0 )
      a[j][i] = j;
    else
      a[j][i] = a[j][i-1] + i;

The compiler spots that the loops can be inverted to give an II of 1 on the inner loop thus improving throughput by a factor of 6.

Now I want to process 2 of these in parallel so an outer loop is introduced for which there are no depencencies between the two iterations of the loop so expect them to be executed in parallel by duplicating the logic.

const int ITEM_LENGTH = 1000;
const int GROUP_SIZE = 10;
uint16_t a[2][GROUP_SIZE][ITEM_LENGTH];

for (int block = 0; block < 2; block++)
for (int j = 0; j < GROUP_SIZE; j++)
    for (int i = 0; i < ITEM_LENGTH; i++)
      if ( i == 0 )
        a[block][j][i] = j;
      else
        a[block][j][i] = a[block][j][i-1] + i;

But this causes the inner loop to revert to an II of 6 and the outer loop to be: 'Serial exe: Memory dependency' citing all 4 combinations of the two assignment statements as the problem.

An attempt to explicitly declare no dependencies inverts the two innermost loops manually and adds an ivdep:

const int ITEM_LENGTH = 1000;
const int GROUP_SIZE = 10;
uint16_t a[2][GROUP_SIZE][ITEM_LENGTH];

for (int block = 0; block < 2; block++)
[[intel::ivdep]]
for (int k = 0; k < GROUP_SIZE * ITEM_LENGTH; k++)
{
int i = k / GROUP_SIZE;
int j = k - i * GROUP_SIZE;

    if ( i == 0 )
      a[block][j][i] = j;
    else
      a[block][j][i] = a[2][j][i-1] + i;

The inner loop is now scheduled with II of 1 (trusting the ivdep) but the outer loop still does not execute in parallel for exactly the same reason as before.

So, given that there really isn't any actual dependency between iterations of the outer loop, how do I persuade the compiler of this so that I achieve parallel execution of the loop iterations?

MGRAV · ‎06-07-2021

Hi,

Did you try with "#pragma unroll 2" ?

I am working on something similar and I would say that something like that should almost work

const int ITEM_LENGTH = 1000;
const int GROUP_SIZE = 10;
uint16_t a[2][GROUP_SIZE][ITEM_LENGTH];

#pragma unroll 2
[[intel::ivdep]]
for (int block = 0; block < 2; block++)

  [[intel::ivdep]]
  for (int j = 0; j < GROUP_SIZE; j++)
    for (int i = 0; i < ITEM_LENGTH; i++)
      if ( i == 0 )
        a[block][j][i] = j;
      else
        a[block][j][i] = a[block][j][i-1] + i;

good luck

HRZ · ‎06-07-2021

Solution mentioned by @MGRAV is the right way to run loop iterations in "parallel". However, the difference you are experiencing here is likely caused by the resource used to implement i[], rather than the loop-carried dependency. If i[] is implemented using registers (which can be quite inefficient in case ITEM_LENGTH is too large), then you should achieve an II of 1 in all of the cases you mentioned above. You can force implementation using registers using the respective pragma mentioned in Intel's documentation and see what happens. If, however, i[] is implemented using Block RAMs, then you will end up with a load/store dependency since it is not possible to perform single-cycle read/write from/to Block RAMs and that is where the II=6 comes from (latency of one Block RAM read + write). There is nothing you can do to improve the II in this case, since your the load/store dependency is real and any attempt to force the compiler to ignore it using ivdep will result in incorrect output.

P.S. You should probably move "if ( i == 0 ) a[i] = j; " outside of the "i" loop and start "i" from 1. This will allow you to completely eliminate the branch from the code and save some area.

Mickleman · ‎06-07-2021

Many thanks for replying HRZ.

Your insights are very helpful. However, this is a very abstracted example of the problem. In the actual design the main inner loop is complex with a critical dependency of over 80 clocks which cannot easily be reduced. The data itself is too large to be implemented as registers.

The long dependency is resolved (as in the example) by operating on many data vectors concurrently and inverting the consequent pair of nested loops. The ivdep is needed because although there is a data dependency on the array as a whole the loop inversion ensures that LD/ST of any one element of the array is separated by at least the natural II (GROUP_SIZE=10 is greater than II of 6 in the example).

I am now trying to run several of these inverted loops in parallel on distinct data sets. In the example this is represented by the third array dimension indexed by block. The first iteration of the loop only accesses array elements with first index equal to 0 (a[0][j][i]), the second iteration only accesses array elements with first index 1 (a[1][j]i]).

Following MGRAV's suggestion on the unroll pragma, it unfortunately does not result in parallel execution because the compiler is insisting there is a memory dependency preventing it. I will be confirming that this is actually the case with an actual run as soon as the FPGA runtime nodes are available again. I will post the results here.

Any further insights would be most welcome.

Mickleman · ‎06-08-2021

Hi again MGRAV and HRZ

I have further developed the example to avoid the false memory dependency and manually unrolled the loop. This allows the compiler to automatically fuse the 2 two loops thus achieving the desired concurrency. BUT even though both loops carry the ivdep the resulting fused loop nevertheless has an II of 6 (as if the ivdeps had been ignored). Here is the code:

const int ITEM_LENGTH = 10000;
const int GROUP_SIZE = 10;
uint16_t a[GROUP_SIZE][ITEM_LENGTH];
uint16_t b[GROUP_SIZE][ITEM_LENGTH];

[[intel::ivdep]]
for (int k = 0; k < GROUP_SIZE * ITEM_LENGTH; k++)
{
int i = k / GROUP_SIZE;
int j = k - i * GROUP_SIZE;

if ( i == 0 )
a[j][i] = j;
else
a[j][i] = a[j][i-1] + i;
}

[[intel::ivdep]]
for (int k = 0; k < GROUP_SIZE * ITEM_LENGTH; k++)
{
int i = k / GROUP_SIZE;
int j = k - i * GROUP_SIZE;

if ( i == 0 )
b[j][i] = j;
else
b[j][i] = b[j][i-1] + i;
}

I'm at a loss. Why can't the fused loop respect the ivdep?

MGRAV · ‎06-08-2021

Hi @Mickleman,

I am not sure but I assume that is the way you get you i and j out of the division and the modulo.

I imagine you rewrite as follow (that do basically the same, without branching)

uint16_t* c=(uint16_t*)a

bool test=(k<GROUP_SIZE) ;

int i=k / GROUP_SIZE;

c[k]= (c[k-1]+i)*(!test)+(test)* (k-i*GROUP_SIZE);

you can get that the compiler don't see the opportunity over the GROUP_SIZE parallelization.

I think to get it automatically you should permute you array, a[j][i] ==> a[i][j] so the dependency inside look like c[k-GROUP_SIZE].

I don't know if I am clear in what I mean

Mickleman · ‎06-08-2021

Thanks @MGRAV

I'm not sure I quite understand what you are suggesting. If you could give me a version of the example changed as you propose I will try it out and feed back the results.

Regards

MGRAV · ‎06-08-2021

maybe something like this, and even I don't know if the compiler can figure out

const int ITEM_LENGTH = 10000;
const int GROUP_SIZE = 10;
uint16_t a[ITEM_LENGTH][GROUP_SIZE];
uint16_t b[ITEM_LENGTH][GROUP_SIZE];

[[intel::ivdep]]
for (int k = 0; k < GROUP_SIZE * ITEM_LENGTH; k++)
{
int i = k / GROUP_SIZE;
int j = k - i * GROUP_SIZE;

if ( i == 0 )
a[i][j] = j;
else
a[i][j] = a[i-1][j] + i;
}

[[intel::ivdep]]
for (int k = 0; k < GROUP_SIZE * ITEM_LENGTH; k++)
{
int i = k / GROUP_SIZE;
int j = k - i * GROUP_SIZE;

if ( i == 0 )
b[i][j] = j;
else
b[i][j] = b[i-1][j] + i;
}

But what the problem with the two loops if it work ?

HRZ · ‎06-08-2021

Can you post the snippet of the report that mentions why the II has been increased to 6? I am not convinced your issue is because of the loop-carried dependency. I believe your issue is because of load/store dependency caused by latency of accessing Block RAM-based buffers which cannot be ignored with ivdep.

You will probably also be better off recovering "i" and "j" in the fused loop as follows, rather than by using "div":

i++;

if (i == ITEM_LENGTH)

{

i = 0;

j++;

}

Note that there is no need to reset "j" here since the loop will exist after ITEM_LENGTH * GROUP_SIZE iterations anyway.

Mickleman · ‎06-09-2021

Hi @HRZ

Thank you for your time helping me to find a solution to this problem - it is very much appreciated.

The reason for the II of 6 is not a mystery - it is as you say a (false) LD/ST memory dependency. If I compile the last example with just one of the loops the II is scheduled at 1 and the code gives the correct answer. So the compiler is able to respect the ivdep in the face of the LD/ST dependency. So why can't it do the same on the fused loop?

On the subject of the i,j recovery you are right this could be done differently (as it is in the full design) but it is not relevant to the issue being explored here.

Mickleman · ‎06-07-2021

Many thanks for replying MGRAV.

I have tried with the unroll pragma but as far as I can see it does not execute in parallel because of the memory dependency on the array - even though this is not a real dependency because each iteration accesses a distinct separate part of the array. I have only been able to confirm this by looking at the Graph Viewer as none of the FPGA runtime nodes is working at the moment.

AnilErinch_A_Intel · ‎08-25-2021

Hi ,

Can you try to unroll the internal loop and use IVDEP on the outer loop of the manually inverted code snippet and see the effect it has on the performance.

Please refer to the line number 55 of the ivdep example given below.

https://github.com/oneapi-src/oneAPIsamples/blob/master/DirectProgramming/DPC%2B%2BFPGA/Tutorials/Features/loop_ivdep/src/loop_ivdep.cpp

Thanks and Regards

Anil