- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am using `malloc_device` to allocate device-side (Iris iGPU) memory through USM. An example of this is,
int hostArray[42]; int* deviceArray = (int*)malloc_device(42 * sizeof(int), dev, ctx_); for (int i = 0; i < 42; i++) hostArray = 42; queue_.submit([&](cl::sycl::handler& h) { // copy hostArray to deviceArray h.memcpy(deviceArray, &hostArray[0], 42 * sizeof(int)); }); queue_.wait();
Now, what’s odd to me is that I can access `deviceArray` outside a kernel, like this:
std::cout << "deviceArray[10] = " << deviceArray[10] << std::endl;
Because of “unified addressing” [1], and since all USM allocations are done “on the host” [2], I assumed that I was outputting the `hostArray` memory address. Fine. But then I modified the value of `deviceArray[10]` inside a kernel:
queue_.submit([&](cl::sycl::handler& h) { h.parallel_for<class foo>(cl::sycl::range<1>{42}, [=](cl::sycl::id<1> ID) { int i = ID[0]; deviceArray++; }); }); queue_.wait();
and I was still able to access `deviceArray` outside a kernel in the same way as above, and modified value was printed. I can understand this working fine if I was using the “host device” but can’t wrap my head around accessing “GPU device” memory outside a kernel, especially if it has been modified inside a kernel. A complete MWE is at the bottom of this email. I suggest running on an Iris node.
Can someone explain this to me? Why/how is it possible to access memory allocated with `malloc_device` outside a kernel? If it’s because it’s allocated by the host — and is duplicated on the host — why when modifications are made inside a kernel does it affect the host memory? `malloc_shared` — wherein data is migrated back and forth between the host and device — would be a different story.
Thanks,
Vince
[1] https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
[2] https://www.colfax-intl.com/downloads/oneAPI_module04_DPCplusplusFundamentals2of2.pdf (slide 23)
// Compile with:
[dpcpp | clang++] -fsycl -o mwe_malloc_device mwe_malloc_device.cc
// // mwe_malloc_device.cc // #include <CL/sycl.hpp> #include <iostream> #ifdef USE_PI_CUDA class CUDASelector : public cl::sycl::device_selector { public: int operator()(const cl::sycl::device& Device) const override { using namespace cl::sycl::info; const std::string DeviceName = Device.get_info<device::name>(); const std::string DeviceVendor = Device.get_info<device::vendor>(); const std::string DeviceDriver = Device.get_info<cl::sycl::info::device::driver_version>(); if (Device.is_gpu() && (DeviceVendor.find("NVIDIA") != std::string::npos) && (DeviceDriver.find("CUDA") != std::string::npos)) { return 1; }; return -1; } }; #endif int main() { // Catch asynchronous exceptions auto exception_handler = [](cl::sycl::exception_list exceptions) { for (std::exception_ptr const& e : exceptions) { try { std::rethrow_exception(e); } catch (cl::sycl::exception const& e) { std::cout << "Caught asynchronous SYCL exception during generation:\n" << e.what() << std::endl; } } }; // Initialize device, queue and context cl::sycl::device dev; #ifdef USE_PI_CUDA CUDASelector cuda_selector; try { dev = cl::sycl::device(cuda_selector); } catch (...) { } #elif USE_SYCL_CPU dev = cl::sycl::device(cl::sycl::cpu_selector()); #elif USE_SYCL_GPU dev = cl::sycl::device(cl::sycl::gpu_selector()); #else dev = cl::sycl::device(cl::sycl::default_selector()); #endif cl::sycl::queue queue = cl::sycl::queue(dev, exception_handler); cl::sycl::context ctx = queue.get_context(); // Name of the device to run on std::string dev_name = queue.get_device().get_info<cl::sycl::info::device::name>(); std::cout << "Using device \"" << dev_name << "\"" << std::endl; // Ensure device can handle USM device allocations. if (!queue.get_device() .get_info<cl::sycl::info::device::usm_device_allocations>()) { std::cout << "ERROR :: device \"" << dev_name << "\" does not support usm_device_allocations!" << std::endl; return 1; } int hostArray[42]; int* deviceArray = (int*)malloc_device(42 * sizeof(int), dev, ctx); for (int i = 0; i < 42; i++) hostArray = 42; queue .submit([&](cl::sycl::handler& h) { // copy hostArray to deviceArray h.memcpy(deviceArray, &hostArray[0], 42 * sizeof(int)); }) .wait(); std::cout << "[Before mod] deviceArray[10] = " << deviceArray[10] << std::endl; queue.submit([&](cl::sycl::handler& h) { h.parallel_for<class foo>( cl::sycl::range<1>{42}, // lambda-capture so we get the actual device memory [=](cl::sycl::id<1> ID) { int i = ID[0]; dev_arr++; }); }); queue.wait(); std::cout << "[After mod] deviceArray[10] = " << deviceArray[10] << std::endl; return 0; }
- Tags:
- General Support
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was just informed by an Intel colleague that the reason this works is because I'm running on an Intel integrated GPU, and these do not have their own dedicated memory; the memory is shared with the host and sits on the same silicon.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Vincent,
Integrated GPU(iGPU) shares memory with the host and that could be one of the reasons for such behavior. However, this shouldn't be the case on a discrete GPU.
--Rahul
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vaidya, Rahul (Intel) wrote:Integrated GPU(iGPU) shares memory with the host and that could be one of the reasons for such behavior.
Indeed, this is what I figured -- makes sense (and gives me piece of mind!).
Vaidya, Rahul (Intel) wrote:However, this shouldn't be the case on a discrete GPU.
Right again!
I can build and run the same code, modulo the `cout`s, using the Intel llvm CUDA support:
[dpcpp | clang++] -O2 -fsycl -std=c++17 -fsycl-targets=nvptx64-nvidia-cuda-sycldevice -Wno-unknown-cuda-version -o mwe_malloc_device mwe_malloc_device.cc
I get a segfault when trying to access the device memory outside a kernel — e.g. using `cout` here. If I replace the `cout` calls with `cl::sycl::stream`s, it works as expected on the CUDA device:
… // Outputs 42 queue .submit([&](cl::sycl::handler& cgh) { cl::sycl::stream out(1024, 256, cgh); cgh.single_task<class print1>( [=] { out << deviceArray[10] << cl::sycl::endl; }); }) .wait_and_throw(); queue.submit([&](cl::sycl::handler& h) { h.parallel_for<class foo>( cl::sycl::range<1>{42}, // lambda-capture so we get the actual device memory [=](cl::sycl::id<1> ID) { int i = ID[0]; deviceArray++; }); }); queue.wait(); // Outputs 43 queue .submit([&](cl::sycl::handler& cgh) { cl::sycl::stream out(1024, 256, cgh); cgh.single_task<class print2>( [=] { out << deviceArray[10] << cl::sycl::endl; }); }) .wait_and_throw(); return 0;
So indeed USM works fine using a discrete CUDA device in this simple example. Looking forward to getting my hands on an Intel Xe device!
Cheers,
Vince
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
That's great! Thanks for the confirmation.
Let us know if we can close this thread.
--Rahul
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vaidya, Rahul (Intel) wrote:Let us know if we can close this thread.
Please do.
Cheers,
Vince
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the update, Vince. We will go ahead and close this thread. Feel free to post a new thread if you have any further queries.
--Rahul
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page