Intel® oneAPI Data Parallel C++
Support for Intel® oneAPI DPC++ Compiler, Intel® oneAPI DPC++ Library, Intel ICX Compiler , Intel® DPC++ Compatibility Tool, and GDB*

Sync with buffers

leilag
Novice
1,165 Views

Hello, 

 

I have a C++ code which I am trying to port to DPC++. There are many function calls inside the main function.

My first approach was to create buffers inside each of those functions. In that scenario, the buffers were destroyed after the function was executed. The problem with this approach is that there is a specific function which might be called 4000 times and I would prefer to keep the created buffers throughout the runtime. That's why I decided to create the buffers inside the main function and pass the buffers as inputs to each function. The problem with this approach is that I need to sync the host and the device after each function call (it doesn't give me the correct answer otherwise). host_accessor() is sometimes helpful but most times doesn't sync. I know I can't use q.wait() in this design either. Could you please guide me the best synchronization method in DPC++.

 

Thank you in advance.

Leila

Labels (1)
0 Kudos
1 Solution
NoorjahanSk_Intel
Moderator
1,140 Views

Hi,

Thanks for reaching out to us.

Since you have mentioned >> I have a C++ code which I am trying to port to DPC++, we suggest you to try using USM (Unified shared memory) model instead of buffer accessor model

As USM provides a familiar pointer-based C++ interface and you can continue to work without modification.

Kernel launch behavior is asynchronous so we need to use q.wait() method to make it synchronous.

In USM we have 3 different allocation models to create memory to input data.

Device: Data is available on device attached memory, but is not directly accessible from the host. We must use explicit copy operations to move

Host: Data located on host side and can be accessed on host directly and device data will be moved using bus.

Shared: Data is available on both host and device and can be migratable.

For more details please refer

Data Parallel C++ book by James Reinders (Page no. 150)

Please find below code snippet for usage of shared allocation method

 

 

 

#include<CL/sycl.hpp>
#include<iostream>
using namespace sycl;

 

int main()
{
    int N=1000;
    queue q;
    auto A = (int*)malloc_shared(N * sizeof(int),q);
    auto B = (int*)malloc_shared(N * sizeof(int),q);
    auto C = (int*)malloc_shared(N * sizeof(int),q);
    for (int i = 0; i < N; i++) {
        A[i] = i; B[i] = 2 * i;
    }
    
    q.submit([&](handler& h) {
        auto R = range<1>{ N };
        h.parallel_for(R, [=](id<1> ID) {
            auto i = ID[0];
            C[i] = A[i] + B[i];
            });
        });
    q.wait();
        std::cout << C[1] << std::endl;
        return 0;
}

 

 

 

If you face any issues while trying the USM implementation in your code,

please provide us a sample reproducer so that we can try it from our end

Thanks & Regards,

Noorjahan.

 

View solution in original post

0 Kudos
4 Replies
NoorjahanSk_Intel
Moderator
1,141 Views

Hi,

Thanks for reaching out to us.

Since you have mentioned >> I have a C++ code which I am trying to port to DPC++, we suggest you to try using USM (Unified shared memory) model instead of buffer accessor model

As USM provides a familiar pointer-based C++ interface and you can continue to work without modification.

Kernel launch behavior is asynchronous so we need to use q.wait() method to make it synchronous.

In USM we have 3 different allocation models to create memory to input data.

Device: Data is available on device attached memory, but is not directly accessible from the host. We must use explicit copy operations to move

Host: Data located on host side and can be accessed on host directly and device data will be moved using bus.

Shared: Data is available on both host and device and can be migratable.

For more details please refer

Data Parallel C++ book by James Reinders (Page no. 150)

Please find below code snippet for usage of shared allocation method

 

 

 

#include<CL/sycl.hpp>
#include<iostream>
using namespace sycl;

 

int main()
{
    int N=1000;
    queue q;
    auto A = (int*)malloc_shared(N * sizeof(int),q);
    auto B = (int*)malloc_shared(N * sizeof(int),q);
    auto C = (int*)malloc_shared(N * sizeof(int),q);
    for (int i = 0; i < N; i++) {
        A[i] = i; B[i] = 2 * i;
    }
    
    q.submit([&](handler& h) {
        auto R = range<1>{ N };
        h.parallel_for(R, [=](id<1> ID) {
            auto i = ID[0];
            C[i] = A[i] + B[i];
            });
        });
    q.wait();
        std::cout << C[1] << std::endl;
        return 0;
}

 

 

 

If you face any issues while trying the USM implementation in your code,

please provide us a sample reproducer so that we can try it from our end

Thanks & Regards,

Noorjahan.

 

0 Kudos
leilag
Novice
1,130 Views

Hello Noorjahan,

 

Thank you for your quick response.

I never considered USM for my code since I was assuming that USM would not give us a good performance. 

Since I have already spent a ton of time on the buffer model, I would like to make sure about the USM model before getting started.

 

What if I allocate memory on the host and copy it to the device when need be? Based on your experience, is this a better approach or using shared memory from the beginning?

 

Sorry, I can't provide you with a snippet code for this question since I am only thinking about the design right now.

 

Thanks,

Leila

 

0 Kudos
leilag
Novice
1,119 Views

Hi again,

 

After looking at the chapter you pointed me to, I understand your point now. Please ignore my previous comment. 

 

Thank you for your assistance!

Leila

0 Kudos
NoorjahanSk_Intel
Moderator
1,099 Views

Hi,

>>What if I allocate memory on the host and copy it to the device when need be? Based on your experience, is this a better approach or using shared memory from the beginning?


Based on the use case, we can use either of the allocation method. we can use host allocation method but it consumes more time if we have more data movement from host to device and vice-versa. It will be good to use shared memory allocation in such cases.


>>After looking at the chapter you pointed me to, I understand your point now.

Glad to know that you have figured it out.


Thanks for accepting as a solution

As this issue has been resolved, we will no longer respond to this thread.

If you require any additional assistance from Intel, please start a new thread.

Any further interaction in this thread will be considered community only.


Thanks & Regards 

Noorjahan


0 Kudos
Reply