- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am writing an application that makes use of multiple GPUs.
I currently have 4 units of Intel Arc a750 in a rig.
But the problem is reproducible on on my dev desktop having only 1 GPU.
I haven't found a convenient way to select the GPUs. This is how I do.
I call platform::get_platforms()
Select "Intel(R) oneAPI Unified Runtime over Level-Zero" platform
Then call this_platform.get_devices(info::device_type::gpu)
And use the devices when instatiating the queues
queue q(compute_device, handle_async_error);
constexpr unsigned int BLOCK_SIZE = 16 * 1024 * 1024 * 4 * 2;
char* gpu_data1 = malloc_shared(BLOCK_SIZE, q);
char* gpu_data2 = malloc_shared(BLOCK_SIZE, q);
char* gpu_data3 = malloc_shared(BLOCK_SIZE, q);
char* gpu_data4 = malloc_shared(BLOCK_SIZE, q);
event memEvent2 = q.prefetch(gpu_data1, BLOCK_SIZE); // only gpu_data1
q.submit([&](handler& h) {
h.depends_on(memEvent2);
h.parallel_for(range{ len }, [=](id<1> i) {
// GPU_DATA1 Available
}
}
GPU_DATA2, GPU_DATA3, GPU_DATA4, are still accessible from the host if using "gpu_selector_v" but if I chose a device via the platform.get_devices()
all of the shared mem (gpu_data 1-4) are unavailable from the host. It seems as the prefetch, prefetched too much. Memory pages are protected (in use by the GPU). I have assured that the device is of the same platform "Intel(R) oneAPI Unified Runtime over Level-Zero".
I don't understand why there is a difference between gpu_selector_v and get_devices if it chooses the same device. Either I dont understand prefetch or there might be a bug.
I worked around it using malloc_host and malloc_device, and manually copying the data.
event memEvent2 = q.memcpy(GpuData, HostData, BLOCK_SIZE);
h.depends_on(memEvent2);
Works as a charm.
I am using DPC++ 2025 on Windows
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you share a test code which can reproduce the change in prefetch behavior when using gpu_selector_v vs get_devices?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hmmm
Now I am confused
I still get a crash but I see no difference between gpu_selector_v and get_devices.
Seems that the crash is under the debugger. (Visual Studio 2022) DPCP 2025
If I run it in the OneApi Cmd, it seems to work.
Both mem_advice and mem_prefetch seems to crash.
First call to mem_prefetch on buffer 1 invalidates buffer 1 on host side (Correct)
But it also invalidates buffer 2 (on host side), (Wrong)
So I cannot populate it.
Below is just a toy program. I send over an array of numbers.
It calculates the x^2 of the value and compares it to 2500.
In my real program I have a loop that prepares data in buffers.
Then sends a submit with some work,
then prepares new data to submit.
So I use several buffers.
So when I submit a job, I want to prefetch the data for that particular job.
I worked around the crash by using
host_malloc, device_malloc, and q.memcpy
That variant works also under the debugger
//==============================================================
// Vector Add is the equivalent of a Hello, World! sample for data parallel
// programs. Building and running the sample verifies that your development
// environment is setup correctly and demonstrates the use of the core features
// of SYCL. This sample runs on both CPU and GPU (or FPGA). When run, it
// computes on both the CPU and offload device, then compares results. If the
// code executes on both CPU and offload device, the device name and a success
// message are displayed. And, your development environment is setup correctly!
//
// For comprehensive instructions regarding SYCL Programming, go to
// https://software.intel.com/en-us/oneapi-programming-guide and search based on
// relevant terms noted in the comments.
//
// SYCL material used in the code sample:
// • A one dimensional array of data.
// • A device queue, buffer, accessor, and kernel.
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>
#include <vector>
#include <string>
#include <array>
using namespace sycl;
// Create an exception handler for asynchronous SYCL exceptions
static auto exception_handler = [](sycl::exception_list e_list) {
for (std::exception_ptr const &e : e_list) {
try {
std::rethrow_exception(e);
}
catch (std::exception const &e) {
#if _DEBUG
std::cout << "Failure" << std::endl;
#endif
std::terminate();
}
}
};
int main(int argc, char* argv[]) {
constexpr int CONST_QUADRATIC_NUMBER = 2500;
int outer_result = -1;
try {
queue q(default_selector_v, exception_handler);
std::cout << "Running on device: "
<< q.get_device().get_info<info::device::name>() << "\n";
constexpr int CONST_BLOCK_SIZE = 1000;
auto shared_mem_input_1 = malloc_shared<int>(CONST_BLOCK_SIZE, q);
// shared_mem_input_2 becomes inaccessible
auto shared_mem_input_2 = malloc_shared<int>(CONST_BLOCK_SIZE, q);
std::array<int, 1> result = { 0 };
std::array<int, 1> parameter = { 0 };
parameter[0] = CONST_QUADRATIC_NUMBER;
for (int i = 0; i < CONST_BLOCK_SIZE; i++) {
shared_mem_input_1[i] = i;
}
{
buffer result_buffer{ result };
buffer parameter_buffer{ parameter };
int HW_SPECIFIC_ADVICE_RO = 0;
//auto memEvent1 = q.mem_advise(shared_mem_input_1, CONST_BLOCK_SIZE, HW_SPECIFIC_ADVICE_RO);
auto memEvent1 = q.prefetch(shared_mem_input_1, CONST_BLOCK_SIZE);
event running_event_1 = q.submit([&](handler& h) {
h.depends_on(memEvent1);
accessor device_result(result_buffer, h, write_only);
accessor device_parameter(parameter_buffer, h, read_only);
h.parallel_for(range{ CONST_BLOCK_SIZE }, [=](id<1> pIndex) {
int val = shared_mem_input_1[pIndex];
int quadratic = val * val;
if (quadratic == device_parameter[0])
{
device_result[0] = val;
}
});
});
// Crashes here BEGIN
for (int i = 0; i < CONST_BLOCK_SIZE; i++) {
shared_mem_input_2[i] = i;
}
// Crashes here END
//auto memEvent2 = q.mem_advise(shared_mem_input_2, CONST_BLOCK_SIZE, HW_SPECIFIC_ADVICE_RO);
auto memEvent2 = q.prefetch(shared_mem_input_2, CONST_BLOCK_SIZE);
event running_event_2 = q.submit([&](handler& h) {
h.depends_on(memEvent2);
accessor device_result(result_buffer, h, write_only);
accessor device_parameter(parameter_buffer, h, read_only);
h.parallel_for(range{ CONST_BLOCK_SIZE }, [=](id<1> pIndex) {
int val = shared_mem_input_2[pIndex];
int quadratic = val * val;
if (quadratic == device_parameter[0])
{
device_result[0] = val;
}
});
});
q.wait();
}
outer_result = result[0];
}
catch (exception const& e) {
std::cout << "An exception is caught for vector add.\n";
std::terminate();
}
if (outer_result > 0)
{
printf("%i is the square root of %i\n", outer_result, CONST_QUADRATIC_NUMBER);
}
else
{
printf("Square root of %i was not found\n", CONST_QUADRATIC_NUMBER);
}
return 0;
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the reproducer, I'll investigate this further to see what's going on.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @mhogstrom,
I cannot reproduce the crash even when run through VS IDE.
Can you try upgrading to the latest Compiler(2025.0.4) and driver(32.0.101.6129) and give this another shot?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, I encountered the same problem on B580.
Switching to the OpenCL backend did not work, and the same issue persists.
Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205)
Target: x86_64-pc-windows-msvc
Thread model: posix
Driver:32.0.101.6256
#include <vector>
#include <iostream>
#include <sycl/sycl.hpp>
int main()try
{
sycl::queue q{sycl::default_selector_v};
constexpr auto size = 128;
std::cout << "Device: " << q.get_device().get_info<sycl::info::device::name>() << '\n'
<< "Backend: " << q.get_backend() << '\n';
std::vector a(size, 1, sycl::usm_allocator<int, sycl::usm::alloc::shared>{q});
std::vector b(size, 1, sycl::usm_allocator<int, sycl::usm::alloc::shared>{q});
std::vector c(size, 1, sycl::usm_allocator<int, sycl::usm::alloc::shared>{q});
auto A = a.data();
auto B = b.data();
auto C = c.data();
q.submit([&] (sycl::handler& h) {
h.parallel_for(sycl::range<1>(size),
[=] (sycl::id<1> idx) { C[idx] = A[idx] + B[idx]; });
}).wait();
for (int i = 0; i < size; i++)
std::cout << c[i] << std::endl;
} catch (const std::exception& e)
{
std::cerr << e.what() << '\n';
}
The VS debugger prompts an exception: read access violation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @YiHang and @mhogstrom,
For USM shared memory, the implicit migration of shared allocations between device and host is implemented using access violation mechanisms to identify access from the host. This is why the driver encounters the access violation only when running under the debugger. Pressing "Continue/continue debugging" should allow the driver to handle this flow properly without crashing the application.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for clarifying this, it is indeed the case.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page