- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am using `malloc_device` to allocate device-side (Iris iGPU) memory through USM. An example of this is,
int hostArray[42];
int* deviceArray = (int*)malloc_device(42 * sizeof(int), dev, ctx_);
for (int i = 0; i < 42; i++) hostArray = 42;
queue_.submit([&](cl::sycl::handler& h) {
// copy hostArray to deviceArray
h.memcpy(deviceArray, &hostArray[0], 42 * sizeof(int));
});
queue_.wait();
Now, what’s odd to me is that I can access `deviceArray` outside a kernel, like this:
std::cout << "deviceArray[10] = " << deviceArray[10] << std::endl;
Because of “unified addressing” [1], and since all USM allocations are done “on the host” [2], I assumed that I was outputting the `hostArray` memory address. Fine. But then I modified the value of `deviceArray[10]` inside a kernel:
queue_.submit([&](cl::sycl::handler& h) {
h.parallel_for<class foo>(cl::sycl::range<1>{42}, [=](cl::sycl::id<1> ID) {
int i = ID[0];
deviceArray++;
});
});
queue_.wait();
and I was still able to access `deviceArray` outside a kernel in the same way as above, and modified value was printed. I can understand this working fine if I was using the “host device” but can’t wrap my head around accessing “GPU device” memory outside a kernel, especially if it has been modified inside a kernel. A complete MWE is at the bottom of this email. I suggest running on an Iris node.
Can someone explain this to me? Why/how is it possible to access memory allocated with `malloc_device` outside a kernel? If it’s because it’s allocated by the host — and is duplicated on the host — why when modifications are made inside a kernel does it affect the host memory? `malloc_shared` — wherein data is migrated back and forth between the host and device — would be a different story.
Thanks,
Vince
[1] https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc
[2] https://www.colfax-intl.com/downloads/oneAPI_module04_DPCplusplusFundamentals2of2.pdf (slide 23)
// Compile with:
[dpcpp | clang++] -fsycl -o mwe_malloc_device mwe_malloc_device.cc
//
// mwe_malloc_device.cc
//
#include <CL/sycl.hpp>
#include <iostream>
#ifdef USE_PI_CUDA
class CUDASelector : public cl::sycl::device_selector {
public:
int operator()(const cl::sycl::device& Device) const override {
using namespace cl::sycl::info;
const std::string DeviceName = Device.get_info<device::name>();
const std::string DeviceVendor = Device.get_info<device::vendor>();
const std::string DeviceDriver =
Device.get_info<cl::sycl::info::device::driver_version>();
if (Device.is_gpu() && (DeviceVendor.find("NVIDIA") != std::string::npos) &&
(DeviceDriver.find("CUDA") != std::string::npos)) {
return 1;
};
return -1;
}
};
#endif
int main() {
// Catch asynchronous exceptions
auto exception_handler = [](cl::sycl::exception_list exceptions) {
for (std::exception_ptr const& e : exceptions) {
try {
std::rethrow_exception(e);
} catch (cl::sycl::exception const& e) {
std::cout << "Caught asynchronous SYCL exception during generation:\n"
<< e.what() << std::endl;
}
}
};
// Initialize device, queue and context
cl::sycl::device dev;
#ifdef USE_PI_CUDA
CUDASelector cuda_selector;
try {
dev = cl::sycl::device(cuda_selector);
} catch (...) {
}
#elif USE_SYCL_CPU
dev = cl::sycl::device(cl::sycl::cpu_selector());
#elif USE_SYCL_GPU
dev = cl::sycl::device(cl::sycl::gpu_selector());
#else
dev = cl::sycl::device(cl::sycl::default_selector());
#endif
cl::sycl::queue queue = cl::sycl::queue(dev, exception_handler);
cl::sycl::context ctx = queue.get_context();
// Name of the device to run on
std::string dev_name =
queue.get_device().get_info<cl::sycl::info::device::name>();
std::cout << "Using device \"" << dev_name << "\"" << std::endl;
// Ensure device can handle USM device allocations.
if (!queue.get_device()
.get_info<cl::sycl::info::device::usm_device_allocations>()) {
std::cout << "ERROR :: device \"" << dev_name
<< "\" does not support usm_device_allocations!" << std::endl;
return 1;
}
int hostArray[42];
int* deviceArray = (int*)malloc_device(42 * sizeof(int), dev, ctx);
for (int i = 0; i < 42; i++) hostArray = 42;
queue
.submit([&](cl::sycl::handler& h) {
// copy hostArray to deviceArray
h.memcpy(deviceArray, &hostArray[0], 42 * sizeof(int));
})
.wait();
std::cout << "[Before mod] deviceArray[10] = " << deviceArray[10]
<< std::endl;
queue.submit([&](cl::sycl::handler& h) {
h.parallel_for<class foo>(
cl::sycl::range<1>{42},
// lambda-capture so we get the actual device memory
[=](cl::sycl::id<1> ID) {
int i = ID[0];
dev_arr++;
});
});
queue.wait();
std::cout << "[After mod] deviceArray[10] = " << deviceArray[10] << std::endl;
return 0;
}
- Tags:
- General Support
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was just informed by an Intel colleague that the reason this works is because I'm running on an Intel integrated GPU, and these do not have their own dedicated memory; the memory is shared with the host and sits on the same silicon.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Vincent,
Integrated GPU(iGPU) shares memory with the host and that could be one of the reasons for such behavior. However, this shouldn't be the case on a discrete GPU.
--Rahul
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vaidya, Rahul (Intel) wrote:Integrated GPU(iGPU) shares memory with the host and that could be one of the reasons for such behavior.
Indeed, this is what I figured -- makes sense (and gives me piece of mind!).
Vaidya, Rahul (Intel) wrote:However, this shouldn't be the case on a discrete GPU.
Right again!
I can build and run the same code, modulo the `cout`s, using the Intel llvm CUDA support:
[dpcpp | clang++] -O2 -fsycl -std=c++17 -fsycl-targets=nvptx64-nvidia-cuda-sycldevice -Wno-unknown-cuda-version -o mwe_malloc_device mwe_malloc_device.cc
I get a segfault when trying to access the device memory outside a kernel — e.g. using `cout` here. If I replace the `cout` calls with `cl::sycl::stream`s, it works as expected on the CUDA device:
…
// Outputs 42
queue
.submit([&](cl::sycl::handler& cgh) {
cl::sycl::stream out(1024, 256, cgh);
cgh.single_task<class print1>(
[=] { out << deviceArray[10] << cl::sycl::endl; });
})
.wait_and_throw();
queue.submit([&](cl::sycl::handler& h) {
h.parallel_for<class foo>(
cl::sycl::range<1>{42},
// lambda-capture so we get the actual device memory
[=](cl::sycl::id<1> ID) {
int i = ID[0];
deviceArray++;
});
});
queue.wait();
// Outputs 43
queue
.submit([&](cl::sycl::handler& cgh) {
cl::sycl::stream out(1024, 256, cgh);
cgh.single_task<class print2>(
[=] { out << deviceArray[10] << cl::sycl::endl; });
})
.wait_and_throw();
return 0;
So indeed USM works fine using a discrete CUDA device in this simple example. Looking forward to getting my hands on an Intel Xe device!
Cheers,
Vince
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
That's great! Thanks for the confirmation.
Let us know if we can close this thread.
--Rahul
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vaidya, Rahul (Intel) wrote:Let us know if we can close this thread.
Please do.
Cheers,
Vince
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the update, Vince. We will go ahead and close this thread. Feel free to post a new thread if you have any further queries.
--Rahul
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page