- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
I am writing an application that makes use of multiple GPUs.
I currently have 4 units of Intel Arc a750 in a rig.
But the problem is reproducible on on my dev desktop having only 1 GPU.
I haven't found a convenient way to select the GPUs. This is how I do.
I call platform::get_platforms()
Select "Intel(R) oneAPI Unified Runtime over Level-Zero" platform
Then call this_platform.get_devices(info::device_type::gpu)
And use the devices when instatiating the queues
queue q(compute_device, handle_async_error);
constexpr unsigned int BLOCK_SIZE = 16 * 1024 * 1024 * 4 * 2;
char* gpu_data1 = malloc_shared(BLOCK_SIZE, q);
char* gpu_data2 = malloc_shared(BLOCK_SIZE, q);
char* gpu_data3 = malloc_shared(BLOCK_SIZE, q);
char* gpu_data4 = malloc_shared(BLOCK_SIZE, q);
event memEvent2 = q.prefetch(gpu_data1, BLOCK_SIZE); // only gpu_data1
q.submit([&](handler& h) {
h.depends_on(memEvent2);
h.parallel_for(range{ len }, [=](id<1> i) {
// GPU_DATA1 Available
}
}
GPU_DATA2, GPU_DATA3, GPU_DATA4, are still accessible from the host if using "gpu_selector_v" but if I chose a device via the platform.get_devices()
all of the shared mem (gpu_data 1-4) are unavailable from the host. It seems as the prefetch, prefetched too much. Memory pages are protected (in use by the GPU). I have assured that the device is of the same platform "Intel(R) oneAPI Unified Runtime over Level-Zero".
I don't understand why there is a difference between gpu_selector_v and get_devices if it chooses the same device. Either I dont understand prefetch or there might be a bug.
I worked around it using malloc_host and malloc_device, and manually copying the data.
event memEvent2 = q.memcpy(GpuData, HostData, BLOCK_SIZE);
h.depends_on(memEvent2);
Works as a charm.
I am using DPC++ 2025 on Windows
Ссылка скопирована
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Can you share a test code which can reproduce the change in prefetch behavior when using gpu_selector_v vs get_devices?
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Hmmm
Now I am confused
I still get a crash but I see no difference between gpu_selector_v and get_devices.
Seems that the crash is under the debugger. (Visual Studio 2022) DPCP 2025
If I run it in the OneApi Cmd, it seems to work.
Both mem_advice and mem_prefetch seems to crash.
First call to mem_prefetch on buffer 1 invalidates buffer 1 on host side (Correct)
But it also invalidates buffer 2 (on host side), (Wrong)
So I cannot populate it.
Below is just a toy program. I send over an array of numbers.
It calculates the x^2 of the value and compares it to 2500.
In my real program I have a loop that prepares data in buffers.
Then sends a submit with some work,
then prepares new data to submit.
So I use several buffers.
So when I submit a job, I want to prefetch the data for that particular job.
I worked around the crash by using
host_malloc, device_malloc, and q.memcpy
That variant works also under the debugger
//==============================================================
// Vector Add is the equivalent of a Hello, World! sample for data parallel
// programs. Building and running the sample verifies that your development
// environment is setup correctly and demonstrates the use of the core features
// of SYCL. This sample runs on both CPU and GPU (or FPGA). When run, it
// computes on both the CPU and offload device, then compares results. If the
// code executes on both CPU and offload device, the device name and a success
// message are displayed. And, your development environment is setup correctly!
//
// For comprehensive instructions regarding SYCL Programming, go to
// https://software.intel.com/en-us/oneapi-programming-guide and search based on
// relevant terms noted in the comments.
//
// SYCL material used in the code sample:
// • A one dimensional array of data.
// • A device queue, buffer, accessor, and kernel.
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>
#include <vector>
#include <string>
#include <array>
using namespace sycl;
// Create an exception handler for asynchronous SYCL exceptions
static auto exception_handler = [](sycl::exception_list e_list) {
for (std::exception_ptr const &e : e_list) {
try {
std::rethrow_exception(e);
}
catch (std::exception const &e) {
#if _DEBUG
std::cout << "Failure" << std::endl;
#endif
std::terminate();
}
}
};
int main(int argc, char* argv[]) {
constexpr int CONST_QUADRATIC_NUMBER = 2500;
int outer_result = -1;
try {
queue q(default_selector_v, exception_handler);
std::cout << "Running on device: "
<< q.get_device().get_info<info::device::name>() << "\n";
constexpr int CONST_BLOCK_SIZE = 1000;
auto shared_mem_input_1 = malloc_shared<int>(CONST_BLOCK_SIZE, q);
// shared_mem_input_2 becomes inaccessible
auto shared_mem_input_2 = malloc_shared<int>(CONST_BLOCK_SIZE, q);
std::array<int, 1> result = { 0 };
std::array<int, 1> parameter = { 0 };
parameter[0] = CONST_QUADRATIC_NUMBER;
for (int i = 0; i < CONST_BLOCK_SIZE; i++) {
shared_mem_input_1[i] = i;
}
{
buffer result_buffer{ result };
buffer parameter_buffer{ parameter };
int HW_SPECIFIC_ADVICE_RO = 0;
//auto memEvent1 = q.mem_advise(shared_mem_input_1, CONST_BLOCK_SIZE, HW_SPECIFIC_ADVICE_RO);
auto memEvent1 = q.prefetch(shared_mem_input_1, CONST_BLOCK_SIZE);
event running_event_1 = q.submit([&](handler& h) {
h.depends_on(memEvent1);
accessor device_result(result_buffer, h, write_only);
accessor device_parameter(parameter_buffer, h, read_only);
h.parallel_for(range{ CONST_BLOCK_SIZE }, [=](id<1> pIndex) {
int val = shared_mem_input_1[pIndex];
int quadratic = val * val;
if (quadratic == device_parameter[0])
{
device_result[0] = val;
}
});
});
// Crashes here BEGIN
for (int i = 0; i < CONST_BLOCK_SIZE; i++) {
shared_mem_input_2[i] = i;
}
// Crashes here END
//auto memEvent2 = q.mem_advise(shared_mem_input_2, CONST_BLOCK_SIZE, HW_SPECIFIC_ADVICE_RO);
auto memEvent2 = q.prefetch(shared_mem_input_2, CONST_BLOCK_SIZE);
event running_event_2 = q.submit([&](handler& h) {
h.depends_on(memEvent2);
accessor device_result(result_buffer, h, write_only);
accessor device_parameter(parameter_buffer, h, read_only);
h.parallel_for(range{ CONST_BLOCK_SIZE }, [=](id<1> pIndex) {
int val = shared_mem_input_2[pIndex];
int quadratic = val * val;
if (quadratic == device_parameter[0])
{
device_result[0] = val;
}
});
});
q.wait();
}
outer_result = result[0];
}
catch (exception const& e) {
std::cout << "An exception is caught for vector add.\n";
std::terminate();
}
if (outer_result > 0)
{
printf("%i is the square root of %i\n", outer_result, CONST_QUADRATIC_NUMBER);
}
else
{
printf("Square root of %i was not found\n", CONST_QUADRATIC_NUMBER);
}
return 0;
}- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Thanks for the reproducer, I'll investigate this further to see what's going on.
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Hi @mhogstrom,
I cannot reproduce the crash even when run through VS IDE.
Can you try upgrading to the latest Compiler(2025.0.4) and driver(32.0.101.6129) and give this another shot?
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Hi, I encountered the same problem on B580.
Switching to the OpenCL backend did not work, and the same issue persists.
Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205)
Target: x86_64-pc-windows-msvc
Thread model: posix
Driver:32.0.101.6256
#include <vector>
#include <iostream>
#include <sycl/sycl.hpp>
int main()try
{
sycl::queue q{sycl::default_selector_v};
constexpr auto size = 128;
std::cout << "Device: " << q.get_device().get_info<sycl::info::device::name>() << '\n'
<< "Backend: " << q.get_backend() << '\n';
std::vector a(size, 1, sycl::usm_allocator<int, sycl::usm::alloc::shared>{q});
std::vector b(size, 1, sycl::usm_allocator<int, sycl::usm::alloc::shared>{q});
std::vector c(size, 1, sycl::usm_allocator<int, sycl::usm::alloc::shared>{q});
auto A = a.data();
auto B = b.data();
auto C = c.data();
q.submit([&] (sycl::handler& h) {
h.parallel_for(sycl::range<1>(size),
[=] (sycl::id<1> idx) { C[idx] = A[idx] + B[idx]; });
}).wait();
for (int i = 0; i < size; i++)
std::cout << c[i] << std::endl;
} catch (const std::exception& e)
{
std::cerr << e.what() << '\n';
}
The VS debugger prompts an exception: read access violation.
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Hi @YiHang and @mhogstrom,
For USM shared memory, the implicit migration of shared allocations between device and host is implemented using access violation mechanisms to identify access from the host. This is why the driver encounters the access violation only when running under the debugger. Pressing "Continue/continue debugging" should allow the driver to handle this flow properly without crashing the application.
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Thank you for clarifying this, it is indeed the case.
- Подписка на RSS-канал
- Отметить тему как новую
- Отметить тему как прочитанную
- Выполнить отслеживание данной Тема для текущего пользователя
- Закладка
- Подписаться
- Страница в формате печати