Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*

USM allocation and access

Pascuzzi__Vincent
3,321 Views

Hi,

I am using `malloc_device` to allocate device-side (Iris iGPU) memory through USM. An example of this is,

 int hostArray[42];
 int* deviceArray = (int*)malloc_device(42 * sizeof(int), dev, ctx_);
 for (int i = 0; i < 42; i++) hostArray = 42;
 queue_.submit([&](cl::sycl::handler& h) {
   // copy hostArray to deviceArray
   h.memcpy(deviceArray, &hostArray[0], 42 * sizeof(int));
 });
 queue_.wait();

Now, what’s odd to me is that I can access `deviceArray` outside a kernel, like this:

std::cout << "deviceArray[10] = " << deviceArray[10] << std::endl;

Because of “unified addressing” [1], and since all USM allocations are done “on the host” [2], I assumed that I was outputting the `hostArray` memory address. Fine. But then I modified the value of `deviceArray[10]` inside a kernel:

 queue_.submit([&](cl::sycl::handler& h) {
   h.parallel_for<class foo>(cl::sycl::range<1>{42}, [=](cl::sycl::id<1> ID) {
     int i = ID[0];
     deviceArray++;
   });
 });
 queue_.wait();

and I was still able to access `deviceArray` outside a kernel in the same way as above, and modified value was printed. I can understand this working fine if I was using the “host device” but can’t wrap my head around accessing “GPU device” memory outside a kernel, especially if it has been modified inside a kernel. A complete MWE is at the bottom of this email. I suggest running on an Iris node.

Can someone explain this to me? Why/how is it possible to access memory allocated with `malloc_device` outside a kernel? If it’s because it’s allocated by the host — and is duplicated on the host — why when modifications are made inside a kernel does it affect the host memory? `malloc_shared` — wherein data is migrated back and forth between the host and device — would be a different story.


Thanks,
Vince

[1] https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
[2] https://www.colfax-intl.com/downloads/oneAPI_module04_DPCplusplusFundamentals2of2.pdf (slide 23)


// Compile with: 

[dpcpp | clang++] -fsycl -o mwe_malloc_device mwe_malloc_device.cc
//
// mwe_malloc_device.cc
//

#include <CL/sycl.hpp>
#include <iostream>

#ifdef USE_PI_CUDA
class CUDASelector : public cl::sycl::device_selector {
 public:
  int operator()(const cl::sycl::device& Device) const override {
    using namespace cl::sycl::info;

    const std::string DeviceName = Device.get_info<device::name>();
    const std::string DeviceVendor = Device.get_info<device::vendor>();
    const std::string DeviceDriver =
        Device.get_info<cl::sycl::info::device::driver_version>();

    if (Device.is_gpu() && (DeviceVendor.find("NVIDIA") != std::string::npos) &&
        (DeviceDriver.find("CUDA") != std::string::npos)) {
      return 1;
    };
    return -1;
  }
};
#endif

int main() {
  // Catch asynchronous exceptions
  auto exception_handler = [](cl::sycl::exception_list exceptions) {
    for (std::exception_ptr const& e : exceptions) {
      try {
        std::rethrow_exception(e);
      } catch (cl::sycl::exception const& e) {
        std::cout << "Caught asynchronous SYCL exception during generation:\n"
                  << e.what() << std::endl;
      }
    }
  };
  // Initialize device, queue and context
  cl::sycl::device dev;
#ifdef USE_PI_CUDA
  CUDASelector cuda_selector;
  try {
    dev = cl::sycl::device(cuda_selector);
  } catch (...) {
  }
#elif USE_SYCL_CPU
  dev = cl::sycl::device(cl::sycl::cpu_selector());
#elif USE_SYCL_GPU
  dev = cl::sycl::device(cl::sycl::gpu_selector());
#else
  dev = cl::sycl::device(cl::sycl::default_selector());
#endif
  cl::sycl::queue queue = cl::sycl::queue(dev, exception_handler);
  cl::sycl::context ctx = queue.get_context();
  // Name of the device to run on
  std::string dev_name =
      queue.get_device().get_info<cl::sycl::info::device::name>();
  std::cout << "Using device \"" << dev_name << "\"" << std::endl;

  // Ensure device can handle USM device allocations.
  if (!queue.get_device()
           .get_info<cl::sycl::info::device::usm_device_allocations>()) {
    std::cout << "ERROR :: device \"" << dev_name
              << "\" does not support usm_device_allocations!" << std::endl;
    return 1;
  }
  int hostArray[42];
  int* deviceArray = (int*)malloc_device(42 * sizeof(int), dev, ctx);
  for (int i = 0; i < 42; i++) hostArray = 42;
  queue
      .submit([&](cl::sycl::handler& h) {
        // copy hostArray to deviceArray
        h.memcpy(deviceArray, &hostArray[0], 42 * sizeof(int));
      })
      .wait();

  std::cout << "[Before mod] deviceArray[10] = " << deviceArray[10]
            << std::endl;

  queue.submit([&](cl::sycl::handler& h) {
    h.parallel_for<class foo>(
        cl::sycl::range<1>{42},
        // lambda-capture so we get the actual device memory
        [=](cl::sycl::id<1> ID) {
          int i = ID[0];
          dev_arr++;
        });
  });
  queue.wait();

  std::cout << "[After mod] deviceArray[10] = " << deviceArray[10] << std::endl;

  return 0;
}
0 Kudos
6 Replies
Pascuzzi__Vincent
3,321 Views

I was just informed by an Intel colleague that the reason this works is because I'm running on an Intel integrated GPU, and these do not have their own dedicated memory; the memory is shared with the host and sits on the same silicon.

0 Kudos
RahulV_intel
Moderator
3,321 Views

Hi Vincent,

Integrated GPU(iGPU) shares memory with the host and that could be one of the reasons for such behavior. However, this shouldn't be the case on a discrete GPU.

 

--Rahul

0 Kudos
Pascuzzi__Vincent
3,321 Views

Vaidya, Rahul (Intel) wrote:

Integrated GPU(iGPU) shares memory with the host and that could be one of the reasons for such behavior.

Indeed, this is what I figured -- makes sense (and gives me piece of mind!).

Vaidya, Rahul (Intel) wrote:

However, this shouldn't be the case on a discrete GPU.

Right again!

I can build and run the same code, modulo the `cout`s, using the Intel llvm CUDA support:

[dpcpp | clang++] -O2 -fsycl -std=c++17 -fsycl-targets=nvptx64-nvidia-cuda-sycldevice -Wno-unknown-cuda-version -o mwe_malloc_device mwe_malloc_device.cc 

I get a segfault when trying to access the device memory outside a kernel — e.g. using `cout` here. If I replace the `cout` calls with `cl::sycl::stream`s, it works as expected on the CUDA device:

…
 // Outputs 42
 queue
     .submit([&](cl::sycl::handler& cgh) {
       cl::sycl::stream out(1024, 256, cgh);
       cgh.single_task<class print1>(
           [=] { out << deviceArray[10] << cl::sycl::endl; });
     })
     .wait_and_throw();

 queue.submit([&](cl::sycl::handler& h) {
   h.parallel_for<class foo>(
       cl::sycl::range<1>{42},
       // lambda-capture so we get the actual device memory
       [=](cl::sycl::id<1> ID) {
         int i = ID[0];
         deviceArray++;
       });
 });
 queue.wait();

 // Outputs 43
 queue
     .submit([&](cl::sycl::handler& cgh) {
       cl::sycl::stream out(1024, 256, cgh);
       cgh.single_task<class print2>(
           [=] { out << deviceArray[10] << cl::sycl::endl; });
     })
     .wait_and_throw();

return 0;

So indeed USM works fine using a discrete CUDA device in this simple example. Looking forward to getting my hands on an Intel Xe device!

 

Cheers,

                Vince

0 Kudos
RahulV_intel
Moderator
3,321 Views

Hi,

That's great! Thanks for the confirmation.

Let us know if we can close this thread.

 

--Rahul

 

 

0 Kudos
Pascuzzi__Vincent
3,321 Views

Vaidya, Rahul (Intel) wrote:

Let us know if we can close this thread.

Please do.

 

Cheers,

Vince

0 Kudos
RahulV_intel
Moderator
3,321 Views

Thanks for the update, Vince. We will go ahead and close this thread. Feel free to post a new thread if you have any further queries.

 

--Rahul

0 Kudos
Reply