Solved: Data level parallelism on FPGA with kernel replication using oneAPI

asenjo · ‎07-05-2022

Hi,

I'm playing with kernel replication on FPGA using oneAPI. There is a tutorial on kernel replication here, but it is exploiting pipeline parallelism whereas I want to exploit data-level parallelism. Say I have two identical kernels but one process half the data and the other the other half. Here is my attempt with a vector add example, but I would like to ask for advice in case there is a better way to do the same:

using IntVector = std::vector<int,oneapi::tbb::cache_aligned_allocator<int>>; 

template<bool HBM_enabled, int Replica, int NumRep, int unroll_factor>
sycl::event VectorAdd(queue &q, const IntVector &a_vector, const IntVector &b_vector, IntVector &sum_parallel) {
  // Create the range object for the vectors managed by the buffer.
  size_t num_items{a_vector.size()};

  int begin = Replica * num_items / NumRep;
  int end   = (Replica +1) * num_items / NumRep;
  // Create buffers that hold the data shared between the host and the devices.
  // The buffer destructor is responsible to copy the data back to host when it
  // goes out of scope.
  buffer a_buf{a_vector.begin()+begin, a_vector.begin()+end};
  buffer b_buf{b_vector.begin()+begin, b_vector.begin()+end};
  buffer sum_buf{sum_parallel.begin()+begin, sum_parallel.begin()+end};
  sum_buf.set_final_data(sum_parallel.begin()+begin);
  sum_buf.set_write_back();
  // Submit a command group to the queue by a lambda function that contains the
  // data access permission and device computation (kernel).
  auto e = q.submit([&](handler &h) {

    if constexpr (HBM_enabled){ 
      ext::oneapi::accessor_property_list PL0{ext::intel::buffer_location<Replica*3>};
      ext::oneapi::accessor_property_list PL1{ext::intel::buffer_location<Replica*3+1>};
      ext::oneapi::accessor_property_list PL2{no_init,ext::intel::buffer_location<Replica*3+2>};
      accessor a{a_buf, h, read_only, PL0};
      accessor b{b_buf, h, read_only, PL1};
      accessor sum{sum_buf, h, write_only, PL2};
      h.single_task<VAdd<HBM_enabled,Replica,unroll_factor>>([=]() [[intel::kernel_args_restrict]]{
      #pragma unroll unroll_factor
        for (size_t i = 0; i < end-begin; i++)
           sum[i] = a[i] + b[i]; });
    }
    else{
      accessor a{a_buf, h, read_only};
      accessor b{b_buf, h, read_only};
      // The sum_accessor is used to store (with write permission) the sum data.
      accessor sum{sum_buf, h, write_only, no_init};
      h.single_task<VAdd<HBM_enabled,Replica,unroll_factor>>([=]() [[intel::kernel_args_restrict]]{
      #pragma unroll unroll_factor
        for (size_t i = 0; i < end-begin; i++)
           sum[i] = a[i] + b[i]; });
    }

  });

  return e;
}

Then I can create two replicas in the main() function with:

    auto e0 = VectorAdd<true,0,2,4>(q, a, b, sum_parallel);
    auto e1 = VectorAdd<true,1,2,4>(q, a, b, sum_parallel);
    q.wait();

Where "q" is the FPGA queue, and I'm doing sum_parallel = a + b (of type IntVector).

The issue I found is that this comment:

// The buffer destructor is responsible to copy the data back to host when it
// goes out of scope.

was true if I initialize an output buffer with the whole vector:

buffer sum_buf{sum_parallel};

but not if I initialize the buffers of each kernel replica with a region/block of the original vector of ints:

buffer sum_buf{sum_parallel.begin()+begin, sum_parallel.begin()+end};

In the latter case, the original output vector, sum_parallel, is not updated after the kernels are done (every position holds a 0). I found that adding these two lines solves the problem:

  sum_buf.set_final_data(sum_parallel.begin()+begin);
  sum_buf.set_write_back();

and I imagine that they are required because the runtime is not sure about the sum_buf blocks overlapping in the original vector and conservatively disable the copy_back on buffer destruction. However I couldn't find any comment on this on the oneAPI guides, DPC++ book or examples. So I wanted to share it with you in case you know a better solution or have a piece of advice.

Thank you very much in advance,

Rafa.

aikeu · ‎07-20-2022

Hi asenjo,

Sorry for late reply, I managed to consult one of my respective team member into your question. Based on your written code, the buffers go out of scope at the end of the VectorAdd() function and the kernels get serialized instead in running in parallel.

The main() would look something like this:

1. buffer a_buf1{a_vector.begin()+begin1, a_vector.begin()+end1};

2. buffer b_buf1{b_vector.begin()+begin1, b_vector.begin()+end1};

3. buffer sum_buf1{sum_parallel.begin()+begin1, sum_parallel.begin()+end1};

4.

5. buffer a_buf2{a_vector.begin()+begin1, a_vector.begin()+end2};

6. buffer b_buf2{b_vector.begin()+begin1, b_vector.begin()+end2};

7. buffer sum_buf2{sum_parallel.begin()+begin1, sum_parallel.begin()+end2};

8.

9. auto e0 = VectorAdd<true,0,2,4>(q, a_buf1, b_buf1, sum_buf1);

10. auto e1 = VectorAdd<true,1,2,4>(q, a_buf2, b_buf2, sum_buf2);

11. q.wait();

Another option is to use sub buffers:

https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sub.section.memmodel.app:~:text=A%20buffer%20created%20from%20a%20range%20of%20an%20existing%20buffer%20is%20called%20a

Another option is to use USM but but then the user is responsible to copy data back and forth themselves:

https://www.intel.com/content/www/us/en/developer/articles/code-sample/vector-add.html

Thanks.

Regards,

Aik Eu

View solution in original post

aikeu · ‎07-06-2022

Hi asenjo,

I will try find out regarding your question and will get back to you if there is any suggestion for your question.

Thanks.

Regards,

Aik Eu

asenjo · ‎07-11-2022

Thanks @aikeu !

On a related note, I'm now having an issue with this strategy of partitioning the buffers for each FPGA IP/kernel. In my example I'm using:

using IntVector = std::vector<int,oneapi::tbb::cache_aligned_allocator<int>>; 
IntVector sum_parallel(vector_size);

that as far as I know would align my vectors to 128Bytes boundaries. Actually, everything works fine if I use a buffer to refer to the whole IntVector:

buffer sum_buf{sum_parallel};

However, when partitioning sum_parallel in one buffer per kernel (with the corresponding block of data):

buffer sum_buf{sum_parallel.begin()+begin, sum_parallel.begin()+end};

I get this warning at runtime:

** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 8388608 bytes from device to host because of lack of alignment                                     
** host ptr (0xd5198d0) and/or dev offset (0x3e0000000) is not aligned to 64 bytes

I then found how to pass an allocator to the buffer constructor:

buffer sum_buf{sum_parallel.begin()+begin, sum_parallel.begin()+end, oneapi::tbb::cache_aligned_allocator<float>{}};

But I still get the not-aligned warning and the corresponding slowdown when moving the data to/from the FPGA. I'm pretty sure that each "sub-buffer" (per kernel) is not aligned because I get 4 warnings when invoking 4 kernels or 16 warnings if I create 16 FPGA kernels. What am I doing wrong?

Thanks once again.

Rafa.

aikeu · ‎07-20-2022

Hi asenjo,

Sorry for late reply, I managed to consult one of my respective team member into your question. Based on your written code, the buffers go out of scope at the end of the VectorAdd() function and the kernels get serialized instead in running in parallel.

The main() would look something like this:

1. buffer a_buf1{a_vector.begin()+begin1, a_vector.begin()+end1};

2. buffer b_buf1{b_vector.begin()+begin1, b_vector.begin()+end1};

3. buffer sum_buf1{sum_parallel.begin()+begin1, sum_parallel.begin()+end1};

4.

5. buffer a_buf2{a_vector.begin()+begin1, a_vector.begin()+end2};

6. buffer b_buf2{b_vector.begin()+begin1, b_vector.begin()+end2};

7. buffer sum_buf2{sum_parallel.begin()+begin1, sum_parallel.begin()+end2};

8.

9. auto e0 = VectorAdd<true,0,2,4>(q, a_buf1, b_buf1, sum_buf1);

10. auto e1 = VectorAdd<true,1,2,4>(q, a_buf2, b_buf2, sum_buf2);

11. q.wait();

Another option is to use sub buffers:

https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sub.section.memmodel.app:~:text=A%20buffer%20created%20from%20a%20range%20of%20an%20existing%20buffer%20is%20called%20a

Another option is to use USM but but then the user is responsible to copy data back and forth themselves:

https://www.intel.com/content/www/us/en/developer/articles/code-sample/vector-add.html

Thanks.

Regards,

Aik Eu

aikeu · ‎07-25-2022

Hi asenjo,

Any further question on your side from the previous comment?

Thanks.

Regards,

Aik Eu

asenjo · ‎07-26-2022

Thanks Aik, that solves the serialization problem and the alignment issue has also disappeared. Anyway, I prefer the previous interface for the VectorAdd function (taking the IntVectors instead of the buffers, and internally computing the block sizes), so I'll further investigate if I can pass the IntVectors, construct the buffers inside the function and move them outside to the caller so that they are not destroyed inside.

Thanks,

Rafa.

aikeu · ‎07-26-2022

Hi asenjo,

Good to know that the serialization problem and the alignment issue has been resolved. I will close this thread for now. Do raise another thread if there is any new question from your side.

Thanks.

Regards,

Aik Eu

Regards,

Aik Eu