OneAPi: Iterative read write with swapped memory locations

kkvasan · ‎02-01-2022

Hi All,

I am using oneAPI to implement an application on Arria 10 GX acceleration card for my research work. There is a long kernel pipeline and input and output memory locations should be swapped for each iteration. Initially read and write loops were separate kernels but by that i can't synchronise the memory read and write for multiple iterations. Hence merged the read and write into one nested loop as follows.

        [[intel::max_concurrency(1)]]
        for(int itr = 0; itr < 2*n_iter; itr++){
          accessor ptrR1 = (itr & 1) == 0 ? in1 : out1;
          accessor ptrW1 = (itr & 1) == 1 ? in1 : out1;

          auto input_ptr = ptrR1.get_pointer();
          auto output_ptr = ptrW1.get_pointer();

          [[intel::initiation_interval(1)]]
          [[intel::ivdep]]
          [[intel::max_concurrency(0)]]
          for(int i = 0; i < total_itr; i++){
            vec1 = ptrR1[i];
            pipeS::PipeAt<idx1>::write(vec1);


            vecW1 = pipeS::PipeAt<idx2>::read();
            ptrW1[i] = vecW1; //vecW1;

          }

        }

This one works but i am getting reduced performance. around 8 times less bandwidth than expected. same inner loop without pipes, just copying data to write location gives expected performance. any suggestion/ advice to fix the performance issue is appreciated

Many Thanks,
Vasan

yuguen · ‎02-02-2022

Hey Vasan,

Can you share a report of both versions of your code? (no quartus compile run)

That would be helpful for identifying why you are seeing a throughput drop.

If there is no difference in the reports, it may be that your pipe operations are blocked (trying to read to an empty pipe/to write to a full pipe).

Yohann

kkvasan · ‎02-03-2022

Hi Yohann,

Thanks for the reply.
Did a few experiments, it seems like above the loop is mapped to a stall-enabled cluster.
There is a latency in read data getting through processing kernel and returning to the pipe for mem write .
within this latency entire cluster stop for each iteration.
when modifying processing kernel such that it just pop the data and push a random data to write pipe,
I am getting expected performance.
is there a way to make mem read cluster and mem write cluster stall-free as in following?

https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top/introduction-to-fpga-design-concepts/scheduling/clustering-the-datapath.html

Kind regards,
Vasan

yuguen · ‎02-03-2022

From what I can see in this code snippet, your inner loop reads and writes to DDR + does blocking pipes operations: therefore it needs to be in a stall-enabled cluster as both the DDR and the pipes may stall your kernel.

If you want to have a stall-free compute loop, you'll want to remove both the DDR accesses and the pipe operations.

If I understand correctly your issue, what stalls your compute kernel are the memory accesses and not the pipe operations?

In that case you may want to copy the relevant data you want to compute on in a local memory, make your compute kernel compute on that local memory and produce its results to another local memory. The results local memory can then be copied back to DDR.

kkvasan · ‎02-03-2022

Hi Yuguen,

Thanks for the reply.
Agree that DDR read should be stall enabled and DDR write should also be stall enabled.
But I want independent DDR read and write clusters for the inner loop since there is no dependency.
It seems like DDR read, Pipe Write, Pipe Read, and DDR write come under the same stall-enabled cluster.

Is there any way to make DDR read and Pipe Write as a separate stall enabled cluster and Pipe read and DDR write to another stall enabled cluster? DDR read shouldn’t stall DDR write

Basically what I want is,
a chunk of data needs to be read (size could be larger than on-chip memory), processed(there is a kernel pipeline), and written back. all these should happen in parallel. on the next iteration, the read and write location should be swapped.

The above code tries to implement the read and write-back of the results in an iterative loop swapping memory locations.
implementing mem read and mem write in the separate kernel doesn't allow swapping as a buffer should be used in one kernel in non-USM designs. when I tried this, the design hangs.

Is there any way to implement this? any suggestions/ advice is highly appreciated

Many Thanks,
Vasan

yuguen · ‎02-03-2022

If I understand your problem correctly, I would loop over:

1/ having a loop reading a part of DDR and storing the data to two local memories for both of your accessors

2/ computing on these local memories

3/ having a loop writing the two local memories back to DDR.

If you have enough private copies of the local memories, the compiler will schedule 1/ 2/ and 3/ in parallel.

So while you are computing 2/, another part of the DDR is being read and the previously computed local memory is being written to DDR.

Having these local mem, you should never stall because of DDR (assuming your kernel is compute bound).

kkvasan · ‎02-03-2022

Hi Yuguen,

Thanks for the advice.
I have to identify the way to transfer data between kernels through local memory as there is a big compute kernel pipeline consisting of 10s of kernels. will try your suggestion!

Many Thanks,
Vasan

yuguen · ‎02-03-2022

Hey Vasan,

Transferring data between kernels without going through DDR is usually done through pipes.

Yohann

kkvasan · ‎02-03-2022

Hi Yohann,

Yes, Using pipes for Kernel to Kernel communication makes life easier.
It seams tweaking code with little bit extra global memory helps to avoid stalling by pipe read and write.

template <int idx1, int idx2> int g_read_write(const rAcc &ptrR1, const wAcc &ptrW1, int total_itr, int delay){

  [[intel::ivdep]]
  [[intel::initiation_interval(1)]]
  for(int i = 0; i < total_itr+delay; i++){
    struct dPath16 vec1 = ptrR1[i+delay];
    if(i < total_itr){
      pipeS::PipeAt<idx1>::write(vec1);
    }

    struct dPath16 vecW1; // = pipeS::PipeAt<idx2>::read();
    if(i >= delay){
      vecW1 = pipeS::PipeAt<idx2>::read();;
    }
    ptrW1[i] = vecW1; //vecW1;;
  }
  return 0;

}

By having a required depth for pipes and delay value, we can avoid stalling due to pipe read and write. it will cost delay*sizeof(dPath16) byes of additional global memory at the beginning of buffer.
this function can be called inside the iterative loop.

Many Thanks,
Vasan

BoonBengT_Intel · ‎02-07-2022

Hi @kkvasan,

Good day, just checking in to see if there is any further doubts in regards to this matter.
Hope your doubts has been clarified.

Best Wishes
BB

kkvasan · ‎02-08-2022

Hi @BoonBengT_Intel ,

Have a good day you too!
Yes, I got my doubts clarified and now I am able to implement the target design.

Kind Regards,
Vasan

BoonBengT_Intel · ‎02-14-2022

Hi @kkvasan,

Great! Good to know that you are able to proceed as need, with no further clarification on this thread, it will be transitioned to community support for further help on doubts in this thread, where we will no longer monitor this thread.
Thank you for the questions and as always pleasure having you here.

Best Wishes
BB