- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am writing an application that makes use of multiple GPUs.
I currently have 4 units of Intel Arc a750 in a rig.
But the problem is reproducible on on my dev desktop having only 1 GPU.
I haven't found a convenient way to select the GPUs. This is how I do.
I call platform::get_platforms()
Select "Intel(R) oneAPI Unified Runtime over Level-Zero" platform
Then call this_platform.get_devices(info::device_type::gpu)
And use the devices when instatiating the queues
queue q(compute_device, handle_async_error);
constexpr unsigned int BLOCK_SIZE = 16 * 1024 * 1024 * 4 * 2;
char* gpu_data1 = malloc_shared(BLOCK_SIZE, q);
char* gpu_data2 = malloc_shared(BLOCK_SIZE, q);
char* gpu_data3 = malloc_shared(BLOCK_SIZE, q);
char* gpu_data4 = malloc_shared(BLOCK_SIZE, q);
event memEvent2 = q.prefetch(gpu_data1, BLOCK_SIZE); // only gpu_data1
q.submit([&](handler& h) {
h.depends_on(memEvent2);
h.parallel_for(range{ len }, [=](id<1> i) {
// GPU_DATA1 Available
}
}
GPU_DATA2, GPU_DATA3, GPU_DATA4, are still accessible from the host if using "gpu_selector_v" but if I chose a device via the platform.get_devices()
all of the shared mem (gpu_data 1-4) are unavailable from the host. It seems as the prefetch, prefetched too much. Memory pages are protected (in use by the GPU). I have assured that the device is of the same platform "Intel(R) oneAPI Unified Runtime over Level-Zero".
I don't understand why there is a difference between gpu_selector_v and get_devices if it chooses the same device. Either I dont understand prefetch or there might be a bug.
I worked around it using malloc_host and malloc_device, and manually copying the data.
event memEvent2 = q.memcpy(GpuData, HostData, BLOCK_SIZE);
h.depends_on(memEvent2);
Works as a charm.
I am using DPC++ 2025 on Windows
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you share a test code which can reproduce the change in prefetch behavior when using gpu_selector_v vs get_devices?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hmmm
Now I am confused
I still get a crash but I see no difference between gpu_selector_v and get_devices.
Seems that the crash is under the debugger. (Visual Studio 2022) DPCP 2025
If I run it in the OneApi Cmd, it seems to work.
Both mem_advice and mem_prefetch seems to crash.
First call to mem_prefetch on buffer 1 invalidates buffer 1 on host side (Correct)
But it also invalidates buffer 2 (on host side), (Wrong)
So I cannot populate it.
Below is just a toy program. I send over an array of numbers.
It calculates the x^2 of the value and compares it to 2500.
In my real program I have a loop that prepares data in buffers.
Then sends a submit with some work,
then prepares new data to submit.
So I use several buffers.
So when I submit a job, I want to prefetch the data for that particular job.
I worked around the crash by using
host_malloc, device_malloc, and q.memcpy
That variant works also under the debugger
//==============================================================
// Vector Add is the equivalent of a Hello, World! sample for data parallel
// programs. Building and running the sample verifies that your development
// environment is setup correctly and demonstrates the use of the core features
// of SYCL. This sample runs on both CPU and GPU (or FPGA). When run, it
// computes on both the CPU and offload device, then compares results. If the
// code executes on both CPU and offload device, the device name and a success
// message are displayed. And, your development environment is setup correctly!
//
// For comprehensive instructions regarding SYCL Programming, go to
// https://software.intel.com/en-us/oneapi-programming-guide and search based on
// relevant terms noted in the comments.
//
// SYCL material used in the code sample:
// • A one dimensional array of data.
// • A device queue, buffer, accessor, and kernel.
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>
#include <vector>
#include <string>
#include <array>
using namespace sycl;
// Create an exception handler for asynchronous SYCL exceptions
static auto exception_handler = [](sycl::exception_list e_list) {
for (std::exception_ptr const &e : e_list) {
try {
std::rethrow_exception(e);
}
catch (std::exception const &e) {
#if _DEBUG
std::cout << "Failure" << std::endl;
#endif
std::terminate();
}
}
};
int main(int argc, char* argv[]) {
constexpr int CONST_QUADRATIC_NUMBER = 2500;
int outer_result = -1;
try {
queue q(default_selector_v, exception_handler);
std::cout << "Running on device: "
<< q.get_device().get_info<info::device::name>() << "\n";
constexpr int CONST_BLOCK_SIZE = 1000;
auto shared_mem_input_1 = malloc_shared<int>(CONST_BLOCK_SIZE, q);
// shared_mem_input_2 becomes inaccessible
auto shared_mem_input_2 = malloc_shared<int>(CONST_BLOCK_SIZE, q);
std::array<int, 1> result = { 0 };
std::array<int, 1> parameter = { 0 };
parameter[0] = CONST_QUADRATIC_NUMBER;
for (int i = 0; i < CONST_BLOCK_SIZE; i++) {
shared_mem_input_1[i] = i;
}
{
buffer result_buffer{ result };
buffer parameter_buffer{ parameter };
int HW_SPECIFIC_ADVICE_RO = 0;
//auto memEvent1 = q.mem_advise(shared_mem_input_1, CONST_BLOCK_SIZE, HW_SPECIFIC_ADVICE_RO);
auto memEvent1 = q.prefetch(shared_mem_input_1, CONST_BLOCK_SIZE);
event running_event_1 = q.submit([&](handler& h) {
h.depends_on(memEvent1);
accessor device_result(result_buffer, h, write_only);
accessor device_parameter(parameter_buffer, h, read_only);
h.parallel_for(range{ CONST_BLOCK_SIZE }, [=](id<1> pIndex) {
int val = shared_mem_input_1[pIndex];
int quadratic = val * val;
if (quadratic == device_parameter[0])
{
device_result[0] = val;
}
});
});
// Crashes here BEGIN
for (int i = 0; i < CONST_BLOCK_SIZE; i++) {
shared_mem_input_2[i] = i;
}
// Crashes here END
//auto memEvent2 = q.mem_advise(shared_mem_input_2, CONST_BLOCK_SIZE, HW_SPECIFIC_ADVICE_RO);
auto memEvent2 = q.prefetch(shared_mem_input_2, CONST_BLOCK_SIZE);
event running_event_2 = q.submit([&](handler& h) {
h.depends_on(memEvent2);
accessor device_result(result_buffer, h, write_only);
accessor device_parameter(parameter_buffer, h, read_only);
h.parallel_for(range{ CONST_BLOCK_SIZE }, [=](id<1> pIndex) {
int val = shared_mem_input_2[pIndex];
int quadratic = val * val;
if (quadratic == device_parameter[0])
{
device_result[0] = val;
}
});
});
q.wait();
}
outer_result = result[0];
}
catch (exception const& e) {
std::cout << "An exception is caught for vector add.\n";
std::terminate();
}
if (outer_result > 0)
{
printf("%i is the square root of %i\n", outer_result, CONST_QUADRATIC_NUMBER);
}
else
{
printf("Square root of %i was not found\n", CONST_QUADRATIC_NUMBER);
}
return 0;
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the reproducer, I'll investigate this further to see what's going on.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page