USM allocation and access

Pascuzzi__Vincent · ‎05-07-2020

Hi,

I am using `malloc_device` to allocate device-side (Iris iGPU) memory through USM. An example of this is,

 int hostArray[42];
 int* deviceArray = (int*)malloc_device(42 * sizeof(int), dev, ctx_);
 for (int i = 0; i < 42; i++) hostArray = 42;
 queue_.submit([&](cl::sycl::handler& h) {
   // copy hostArray to deviceArray
   h.memcpy(deviceArray, &hostArray[0], 42 * sizeof(int));
 });
 queue_.wait();

Now, what’s odd to me is that I can access `deviceArray` outside a kernel, like this:

std::cout << "deviceArray[10] = " << deviceArray[10] << std::endl;

Because of “unified addressing” [1], and since all USM allocations are done “on the host” [2], I assumed that I was outputting the `hostArray` memory address. Fine. But then I modified the value of `deviceArray[10]` inside a kernel:

 queue_.submit([&](cl::sycl::handler& h) {
   h.parallel_for<class foo>(cl::sycl::range<1>{42}, [=](cl::sycl::id<1> ID) {
     int i = ID[0];
     deviceArray++;
   });
 });
 queue_.wait();

and I was still able to access `deviceArray` outside a kernel in the same way as above, and modified value was printed. I can understand this working fine if I was using the “host device” but can’t wrap my head around accessing “GPU device” memory outside a kernel, especially if it has been modified inside a kernel. A complete MWE is at the bottom of this email. I suggest running on an Iris node.

Can someone explain this to me? Why/how is it possible to access memory allocated with `malloc_device` outside a kernel? If it’s because it’s allocated by the host — and is duplicated on the host — why when modifications are made inside a kernel does it affect the host memory? `malloc_shared` — wherein data is migrated back and forth between the host and device — would be a different story.

Thanks,
Vince

[1] https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
[2] https://www.colfax-intl.com/downloads/oneAPI_module04_DPCplusplusFundamentals2of2.pdf (slide 23)

// Compile with:

[dpcpp | clang++] -fsycl -o mwe_malloc_device mwe_malloc_device.cc

//
// mwe_malloc_device.cc
//

#include <CL/sycl.hpp>
#include <iostream>

#ifdef USE_PI_CUDA
class CUDASelector : public cl::sycl::device_selector {
 public:
  int operator()(const cl::sycl::device& Device) const override {
    using namespace cl::sycl::info;

    const std::string DeviceName = Device.get_info<device::name>();
    const std::string DeviceVendor = Device.get_info<device::vendor>();
    const std::string DeviceDriver =
        Device.get_info<cl::sycl::info::device::driver_version>();

    if (Device.is_gpu() && (DeviceVendor.find("NVIDIA") != std::string::npos) &&
        (DeviceDriver.find("CUDA") != std::string::npos)) {
      return 1;
    };
    return -1;
  }
};
#endif

int main() {
  // Catch asynchronous exceptions
  auto exception_handler = [](cl::sycl::exception_list exceptions) {
    for (std::exception_ptr const& e : exceptions) {
      try {
        std::rethrow_exception(e);
      } catch (cl::sycl::exception const& e) {
        std::cout << "Caught asynchronous SYCL exception during generation:\n"
                  << e.what() << std::endl;
      }
    }
  };
  // Initialize device, queue and context
  cl::sycl::device dev;
#ifdef USE_PI_CUDA
  CUDASelector cuda_selector;
  try {
    dev = cl::sycl::device(cuda_selector);
  } catch (...) {
  }
#elif USE_SYCL_CPU
  dev = cl::sycl::device(cl::sycl::cpu_selector());
#elif USE_SYCL_GPU
  dev = cl::sycl::device(cl::sycl::gpu_selector());
#else
  dev = cl::sycl::device(cl::sycl::default_selector());
#endif
  cl::sycl::queue queue = cl::sycl::queue(dev, exception_handler);
  cl::sycl::context ctx = queue.get_context();
  // Name of the device to run on
  std::string dev_name =
      queue.get_device().get_info<cl::sycl::info::device::name>();
  std::cout << "Using device \"" << dev_name << "\"" << std::endl;

  // Ensure device can handle USM device allocations.
  if (!queue.get_device()
           .get_info<cl::sycl::info::device::usm_device_allocations>()) {
    std::cout << "ERROR :: device \"" << dev_name
              << "\" does not support usm_device_allocations!" << std::endl;
    return 1;
  }
  int hostArray[42];
  int* deviceArray = (int*)malloc_device(42 * sizeof(int), dev, ctx);
  for (int i = 0; i < 42; i++) hostArray = 42;
  queue
      .submit([&](cl::sycl::handler& h) {
        // copy hostArray to deviceArray
        h.memcpy(deviceArray, &hostArray[0], 42 * sizeof(int));
      })
      .wait();

  std::cout << "[Before mod] deviceArray[10] = " << deviceArray[10]
            << std::endl;

  queue.submit([&](cl::sycl::handler& h) {
    h.parallel_for<class foo>(
        cl::sycl::range<1>{42},
        // lambda-capture so we get the actual device memory
        [=](cl::sycl::id<1> ID) {
          int i = ID[0];
          dev_arr++;
        });
  });
  queue.wait();

  std::cout << "[After mod] deviceArray[10] = " << deviceArray[10] << std::endl;

  return 0;
}

Pascuzzi__Vincent · ‎05-07-2020

I was just informed by an Intel colleague that the reason this works is because I'm running on an Intel integrated GPU, and these do not have their own dedicated memory; the memory is shared with the host and sits on the same silicon.

RahulV_intel · ‎05-07-2020

Hi Vincent,

Integrated GPU(iGPU) shares memory with the host and that could be one of the reasons for such behavior. However, this shouldn't be the case on a discrete GPU.

--Rahul

Pascuzzi__Vincent · ‎05-08-2020

Vaidya, Rahul (Intel) wrote:
Integrated GPU(iGPU) shares memory with the host and that could be one of the reasons for such behavior.

Indeed, this is what I figured -- makes sense (and gives me piece of mind!).

Vaidya, Rahul (Intel) wrote:
However, this shouldn't be the case on a discrete GPU.

Right again!

I can build and run the same code, modulo the `cout`s, using the Intel llvm CUDA support:

[dpcpp | clang++] -O2 -fsycl -std=c++17 -fsycl-targets=nvptx64-nvidia-cuda-sycldevice -Wno-unknown-cuda-version -o mwe_malloc_device mwe_malloc_device.cc

I get a segfault when trying to access the device memory outside a kernel — e.g. using `cout` here. If I replace the `cout` calls with `cl::sycl::stream`s, it works as expected on the CUDA device:

…
 // Outputs 42
 queue
     .submit([&](cl::sycl::handler& cgh) {
       cl::sycl::stream out(1024, 256, cgh);
       cgh.single_task<class print1>(
           [=] { out << deviceArray[10] << cl::sycl::endl; });
     })
     .wait_and_throw();

 queue.submit([&](cl::sycl::handler& h) {
   h.parallel_for<class foo>(
       cl::sycl::range<1>{42},
       // lambda-capture so we get the actual device memory
       [=](cl::sycl::id<1> ID) {
         int i = ID[0];
         deviceArray++;
       });
 });
 queue.wait();

 // Outputs 43
 queue
     .submit([&](cl::sycl::handler& cgh) {
       cl::sycl::stream out(1024, 256, cgh);
       cgh.single_task<class print2>(
           [=] { out << deviceArray[10] << cl::sycl::endl; });
     })
     .wait_and_throw();

return 0;

So indeed USM works fine using a discrete CUDA device in this simple example. Looking forward to getting my hands on an Intel Xe device!

Cheers,

Vince

RahulV_intel · ‎05-11-2020

Hi,

That's great! Thanks for the confirmation.

Let us know if we can close this thread.

--Rahul

Pascuzzi__Vincent · ‎05-12-2020

Vaidya, Rahul (Intel) wrote:
Let us know if we can close this thread.

Please do.

Cheers,

Vince

RahulV_intel · ‎05-12-2020

Thanks for the update, Vince. We will go ahead and close this thread. Feel free to post a new thread if you have any further queries.

--Rahul