Solved: Sync with buffers

leilag · ‎06-29-2021

Hello,

I have a C++ code which I am trying to port to DPC++. There are many function calls inside the main function.

My first approach was to create buffers inside each of those functions. In that scenario, the buffers were destroyed after the function was executed. The problem with this approach is that there is a specific function which might be called 4000 times and I would prefer to keep the created buffers throughout the runtime. That's why I decided to create the buffers inside the main function and pass the buffers as inputs to each function. The problem with this approach is that I need to sync the host and the device after each function call (it doesn't give me the correct answer otherwise). host_accessor() is sometimes helpful but most times doesn't sync. I know I can't use q.wait() in this design either. Could you please guide me the best synchronization method in DPC++.

Thank you in advance.

Leila

NoorjahanSk_Intel · ‎06-30-2021

Hi,

Thanks for reaching out to us.

Since you have mentioned >> I have a C++ code which I am trying to port to DPC++, we suggest you to try using USM (Unified shared memory) model instead of buffer accessor model

As USM provides a familiar pointer-based C++ interface and you can continue to work without modification.

Kernel launch behavior is asynchronous so we need to use q.wait() method to make it synchronous.

In USM we have 3 different allocation models to create memory to input data.

Device: Data is available on device attached memory, but is not directly accessible from the host. We must use explicit copy operations to move

Host: Data located on host side and can be accessed on host directly and device data will be moved using bus.

Shared: Data is available on both host and device and can be migratable.

For more details please refer

Data Parallel C++ book by James Reinders (Page no. 150)

Please find below code snippet for usage of shared allocation method

#include<CL/sycl.hpp>
#include<iostream>
using namespace sycl;

 

int main()
{
    int N=1000;
    queue q;
    auto A = (int*)malloc_shared(N * sizeof(int),q);
    auto B = (int*)malloc_shared(N * sizeof(int),q);
    auto C = (int*)malloc_shared(N * sizeof(int),q);
    for (int i = 0; i < N; i++) {
        A[i] = i; B[i] = 2 * i;
    }
    
    q.submit([&](handler& h) {
        auto R = range<1>{ N };
        h.parallel_for(R, [=](id<1> ID) {
            auto i = ID[0];
            C[i] = A[i] + B[i];
            });
        });
    q.wait();
        std::cout << C[1] << std::endl;
        return 0;
}

If you face any issues while trying the USM implementation in your code,

please provide us a sample reproducer so that we can try it from our end

Thanks & Regards,

Noorjahan.

View solution in original post

NoorjahanSk_Intel · ‎06-30-2021

Hi,

Thanks for reaching out to us.

Since you have mentioned >> I have a C++ code which I am trying to port to DPC++, we suggest you to try using USM (Unified shared memory) model instead of buffer accessor model

As USM provides a familiar pointer-based C++ interface and you can continue to work without modification.

Kernel launch behavior is asynchronous so we need to use q.wait() method to make it synchronous.

In USM we have 3 different allocation models to create memory to input data.

Device: Data is available on device attached memory, but is not directly accessible from the host. We must use explicit copy operations to move

Host: Data located on host side and can be accessed on host directly and device data will be moved using bus.

Shared: Data is available on both host and device and can be migratable.

For more details please refer

Data Parallel C++ book by James Reinders (Page no. 150)

Please find below code snippet for usage of shared allocation method

#include<CL/sycl.hpp>
#include<iostream>
using namespace sycl;

 

int main()
{
    int N=1000;
    queue q;
    auto A = (int*)malloc_shared(N * sizeof(int),q);
    auto B = (int*)malloc_shared(N * sizeof(int),q);
    auto C = (int*)malloc_shared(N * sizeof(int),q);
    for (int i = 0; i < N; i++) {
        A[i] = i; B[i] = 2 * i;
    }
    
    q.submit([&](handler& h) {
        auto R = range<1>{ N };
        h.parallel_for(R, [=](id<1> ID) {
            auto i = ID[0];
            C[i] = A[i] + B[i];
            });
        });
    q.wait();
        std::cout << C[1] << std::endl;
        return 0;
}

If you face any issues while trying the USM implementation in your code,

please provide us a sample reproducer so that we can try it from our end

Thanks & Regards,

Noorjahan.

leilag · ‎06-30-2021

Hello Noorjahan,

Thank you for your quick response.

I never considered USM for my code since I was assuming that USM would not give us a good performance.

Since I have already spent a ton of time on the buffer model, I would like to make sure about the USM model before getting started.

What if I allocate memory on the host and copy it to the device when need be? Based on your experience, is this a better approach or using shared memory from the beginning?

Sorry, I can't provide you with a snippet code for this question since I am only thinking about the design right now.

Thanks,

Leila

leilag · ‎06-30-2021

Hi again,

After looking at the chapter you pointed me to, I understand your point now. Please ignore my previous comment.

Thank you for your assistance!

Leila

NoorjahanSk_Intel · ‎06-30-2021

Hi,

>>What if I allocate memory on the host and copy it to the device when need be? Based on your experience, is this a better approach or using shared memory from the beginning?

Based on the use case, we can use either of the allocation method. we can use host allocation method but it consumes more time if we have more data movement from host to device and vice-versa. It will be good to use shared memory allocation in such cases.

>>After looking at the chapter you pointed me to, I understand your point now.

Glad to know that you have figured it out.

Thanks for accepting as a solution

As this issue has been resolved, we will no longer respond to this thread.

If you require any additional assistance from Intel, please start a new thread.

Any further interaction in this thread will be considered community only.

Thanks & Regards

Noorjahan

Sync with buffers

Intel® oneAPI Data Parallel C++