Intel® High Level Design
Support for Intel® High Level Synthesis Compiler, DSP Builder, OneAPI for Intel® FPGAs, Intel® FPGA SDK for OpenCL™
701 Discussions

OneAPi: Iterative read write with swapped memory locations

kkvasan
Beginner
1,498 Views

Hi All,

I am using oneAPI to implement an application on Arria 10 GX acceleration card  for my research work. There is a long kernel pipeline and input and output memory locations should be swapped for each iteration. Initially read and write loops were separate kernels but by that i can't synchronise the  memory read and write for multiple iterations. Hence merged the read and write into one nested loop as follows.

 

        [[intel::max_concurrency(1)]]
        for(int itr = 0; itr < 2*n_iter; itr++){
          accessor ptrR1 = (itr & 1) == 0 ? in1 : out1;
          accessor ptrW1 = (itr & 1) == 1 ? in1 : out1;

          auto input_ptr = ptrR1.get_pointer();
          auto output_ptr = ptrW1.get_pointer();

          [[intel::initiation_interval(1)]]
          [[intel::ivdep]]
          [[intel::max_concurrency(0)]]
          for(int i = 0; i < total_itr; i++){
            vec1 = ptrR1[i];
            pipeS::PipeAt<idx1>::write(vec1);


            vecW1 = pipeS::PipeAt<idx2>::read();
            ptrW1[i] = vecW1; //vecW1;

          }

        }

 

 

This one works but i am getting reduced performance.  around 8 times less bandwidth than expected. same inner loop without pipes, just copying data to write location gives expected performance. any suggestion/ advice to fix the performance issue is appreciated


Many Thanks,
Vasan

0 Kudos
11 Replies
yuguen
Employee
1,485 Views

Hey Vasan,

 

Can you share a report of both versions of your code? (no quartus compile run)

That would be helpful for identifying why you are seeing a throughput drop.

 

If there is no difference in the reports, it may be that your pipe operations are blocked (trying to read to an empty pipe/to write to a full pipe).

 

Yohann 

 

 

0 Kudos
kkvasan
Beginner
1,460 Views

Hi Yohann, 

Thanks for the reply. 
Did a few experiments, it seems like above the loop is mapped to a stall-enabled cluster. 
There is a latency in read data getting through processing kernel and returning to the pipe for mem write . 
within this latency entire cluster stop for each iteration. 
when modifying processing kernel such that it just pop the data and push a random data to write pipe, 
I am getting expected performance.  
is there a way to make mem read cluster and mem write cluster stall-free as in following?


https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top/introduction-to-fpga-design-concepts/scheduling/clustering-the-datapath.html 

Kind regards, 
Vasan 

0 Kudos
yuguen
Employee
1,456 Views

From what I can see in this code snippet, your inner loop reads and writes to DDR + does blocking pipes operations: therefore it needs to be in a stall-enabled cluster as both the DDR and the pipes may stall your kernel.

If you want to have a stall-free compute loop, you'll want to remove both the DDR accesses and the pipe operations. 

If I understand correctly your issue, what stalls your compute kernel are the memory accesses and not the pipe operations?

In that case you may want to copy the relevant data you want to compute on in a local memory, make your compute kernel compute on that local memory and produce its results to another local memory. The results local memory can then be copied back to DDR. 

0 Kudos
kkvasan
Beginner
1,452 Views

Hi Yuguen, 


Thanks for the reply. 
Agree that DDR read should be stall enabled and DDR write should also be stall enabled. 
But I want independent DDR read and write clusters for the inner loop since there is no dependency. 
It seems like DDR read, Pipe Write, Pipe Read, and DDR write come under the same stall-enabled cluster. 

Is there any way to make DDR read and Pipe Write as a separate stall enabled cluster and Pipe read and DDR write to another stall enabled cluster? DDR read shouldn’t stall DDR write  

Basically what I want is, 
a chunk of data needs to be read (size could be larger than on-chip memory), processed(there is a kernel pipeline), and written back. all these should happen in parallel.  on the next iteration, the read and write location should be swapped. 

The above code tries to implement the read and write-back of the results in an iterative loop swapping memory locations. 
implementing mem read and mem write in the separate kernel doesn't allow swapping as a buffer should be used in one kernel in non-USM designs.  when I tried this, the design hangs. 

Is there any way to implement this?  any suggestions/ advice is highly appreciated 

Many Thanks,
Vasan

0 Kudos
yuguen
Employee
1,449 Views

If I understand your problem correctly, I would loop over:

1/ having a loop reading a part of DDR and storing the data to two local memories for both of your accessors

2/ computing on these local memories

3/ having a loop writing the two local memories back to DDR.

 

If you have enough private copies of the local memories, the compiler will schedule 1/ 2/ and 3/ in parallel.

So while you are computing 2/, another part of the DDR is being read and the previously computed local memory is being written to DDR.

Having these local mem, you should never stall because of DDR (assuming your kernel is compute bound).

 

 

0 Kudos
kkvasan
Beginner
1,447 Views

Hi Yuguen, 

Thanks for the advice. 
I have to identify the way to transfer data between kernels through local memory as there is a big compute kernel pipeline consisting of 10s of kernels. will try your suggestion! 

Many Thanks, 
Vasan

0 Kudos
yuguen
Employee
1,440 Views

Hey Vasan,

 

Transferring data between kernels without going through DDR is usually done through pipes.

 

Yohann

0 Kudos
kkvasan
Beginner
1,433 Views

Hi Yohann,

Yes, Using pipes for Kernel to Kernel communication makes life easier.
It seams tweaking code with little bit extra global memory helps to avoid stalling by pipe read and write.

 

template <int idx1, int idx2> int g_read_write(const rAcc &ptrR1, const wAcc &ptrW1, int total_itr, int delay){

  [[intel::ivdep]]
  [[intel::initiation_interval(1)]]
  for(int i = 0; i < total_itr+delay; i++){
    struct dPath16 vec1 = ptrR1[i+delay];
    if(i < total_itr){
      pipeS::PipeAt<idx1>::write(vec1);
    }

    struct dPath16 vecW1; // = pipeS::PipeAt<idx2>::read();
    if(i >= delay){
      vecW1 = pipeS::PipeAt<idx2>::read();;
    }
    ptrW1[i] = vecW1; //vecW1;;
  }
  return 0;

}

 

 

By having a required depth for pipes and  delay value,  we can avoid stalling due to pipe read and write.  it will cost delay*sizeof(dPath16) byes of additional global memory at the beginning of buffer.
this function can be called inside the iterative loop.

Many Thanks,
Vasan

0 Kudos
BoonBengT_Intel
Moderator
1,415 Views

Hi @kkvasan,

 

Good day, just checking in to see if there is any further doubts in regards to this matter.
Hope your doubts has been clarified.

Best Wishes
BB

0 Kudos
kkvasan
Beginner
1,410 Views

Hi @BoonBengT_Intel , 

Have a good day you too!
Yes, I got my doubts clarified and now I am able to implement the target design. 

Kind Regards, 
Vasan

0 Kudos
BoonBengT_Intel
Moderator
1,392 Views

Hi @kkvasan,

 

Great! Good to know that you are able to proceed as need, with no further clarification on this thread, it will be transitioned to community support for further help on doubts in this thread, where we will no longer monitor this thread.
Thank you for the questions and as always pleasure having you here.

Best Wishes
BB

0 Kudos
Reply