Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*
724 Discussions

Possible Bug with shared data and queue.prefetch and Platform.getDevices()

mhogstrom
Beginner
640 Views

I am writing an application that makes use of multiple GPUs.
I currently have 4 units of Intel Arc a750 in a rig.
But the problem is reproducible on on my dev desktop having only 1 GPU.

I haven't found a convenient way to select the GPUs. This is how I do.
I call platform::get_platforms()
Select "Intel(R) oneAPI Unified Runtime over Level-Zero" platform
Then call this_platform.get_devices(info::device_type::gpu)
And use the devices when instatiating the queues

queue q(compute_device, handle_async_error); 

constexpr unsigned int BLOCK_SIZE = 16 * 1024 * 1024 * 4 * 2;
char* gpu_data1 = malloc_shared(BLOCK_SIZE, q);
char* gpu_data2 = malloc_shared(BLOCK_SIZE, q);
char* gpu_data3 = malloc_shared(BLOCK_SIZE, q);
char* gpu_data4 = malloc_shared(BLOCK_SIZE, q);

event memEvent2 = q.prefetch(gpu_data1, BLOCK_SIZE); // only gpu_data1

q.submit([&](handler& h) {
h.depends_on(memEvent2);
h.parallel_for(range{ len }, [=](id<1> i) {

// GPU_DATA1 Available

}
}

GPU_DATA2, GPU_DATA3, GPU_DATA4,  are still accessible from the host if using "gpu_selector_v" but if I chose a device via the platform.get_devices()
all of the shared mem (gpu_data 1-4) are unavailable from the host. It seems as the prefetch, prefetched too much. Memory pages are protected (in use by the GPU). I have assured that the device is of the same platform "Intel(R) oneAPI Unified Runtime over Level-Zero". 

I don't understand why there is a difference between gpu_selector_v  and get_devices if it chooses the same device. Either I dont understand prefetch or there might be a bug.

I worked around it using malloc_host and malloc_device, and manually copying the data.
event memEvent2 = q.memcpy(GpuData, HostData, BLOCK_SIZE);
h.depends_on(memEvent2);

Works as a charm.

I am using DPC++ 2025 on Windows

0 Kudos
7 Replies
Sravani_K_Intel
Moderator
446 Views

Can you share a test code which can reproduce the change in prefetch behavior when using gpu_selector_v vs get_devices?

0 Kudos
mhogstrom
Beginner
414 Views

Hmmm

Now I am confused
I still get a crash but I see no difference between gpu_selector_v and get_devices.

Seems that the crash is under the debugger.  (Visual Studio 2022) DPCP 2025

mhogstrom_0-1733827211926.png

If I run it in the OneApi Cmd, it seems to work.

mhogstrom_1-1733827275963.png


Both mem_advice and mem_prefetch seems to crash.
First call to mem_prefetch on buffer 1 invalidates buffer 1 on host side (Correct)
But it also invalidates buffer 2 (on host side), (Wrong)
So I cannot populate it.

Below is just a toy program. I send over an array of numbers.
It calculates the x^2 of the value and compares it to 2500.
In my real program I have a loop that prepares data in buffers.
Then sends a submit with some work,
then prepares new data to submit.
So I use several buffers.

So when I submit a job, I want to prefetch the data for that particular job.
I worked around the crash by using 
host_malloc, device_malloc, and q.memcpy
That variant works also under the debugger


//==============================================================
// Vector Add is the equivalent of a Hello, World! sample for data parallel
// programs. Building and running the sample verifies that your development
// environment is setup correctly and demonstrates the use of the core features
// of SYCL. This sample runs on both CPU and GPU (or FPGA). When run, it
// computes on both the CPU and offload device, then compares results. If the
// code executes on both CPU and offload device, the device name and a success
// message are displayed. And, your development environment is setup correctly!
//
// For comprehensive instructions regarding SYCL Programming, go to
// https://software.intel.com/en-us/oneapi-programming-guide and search based on
// relevant terms noted in the comments.
//
// SYCL material used in the code sample:
// •	A one dimensional array of data.
// •	A device queue, buffer, accessor, and kernel.
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>
#include <vector>
#include <string>
#include <array>

using namespace sycl;

// Create an exception handler for asynchronous SYCL exceptions
static auto exception_handler = [](sycl::exception_list e_list) {
    for (std::exception_ptr const &e : e_list) {
        try {
            std::rethrow_exception(e);
        }
        catch (std::exception const &e) {
#if _DEBUG
        std::cout << "Failure" << std::endl;
#endif
        std::terminate();
        }
    }
};

int main(int argc, char* argv[]) {

    constexpr int CONST_QUADRATIC_NUMBER = 2500;

    int outer_result = -1;
    try {
        queue q(default_selector_v, exception_handler);

        std::cout << "Running on device: "
            << q.get_device().get_info<info::device::name>() << "\n";
        constexpr int CONST_BLOCK_SIZE = 1000;
        auto shared_mem_input_1 = malloc_shared<int>(CONST_BLOCK_SIZE, q);

        // shared_mem_input_2 becomes inaccessible
        auto shared_mem_input_2 = malloc_shared<int>(CONST_BLOCK_SIZE, q);

        std::array<int, 1> result = { 0 };
        

        std::array<int, 1> parameter = { 0 };
        parameter[0] = CONST_QUADRATIC_NUMBER;
        


        for (int i = 0; i < CONST_BLOCK_SIZE; i++) {
            shared_mem_input_1[i] = i;
        }

        {
            buffer result_buffer{ result };
            buffer parameter_buffer{ parameter };
            
            int HW_SPECIFIC_ADVICE_RO = 0;
            //auto memEvent1 = q.mem_advise(shared_mem_input_1, CONST_BLOCK_SIZE, HW_SPECIFIC_ADVICE_RO);
            auto memEvent1 = q.prefetch(shared_mem_input_1, CONST_BLOCK_SIZE);
            event running_event_1 = q.submit([&](handler& h) {
                h.depends_on(memEvent1);

                accessor device_result(result_buffer, h, write_only);
                accessor device_parameter(parameter_buffer, h, read_only);


                h.parallel_for(range{ CONST_BLOCK_SIZE }, [=](id<1> pIndex) {
                    int val = shared_mem_input_1[pIndex];
                    int quadratic = val * val;
                    if (quadratic == device_parameter[0])
                    {
                        device_result[0] = val;
                    }
                    });
                });


            // Crashes here BEGIN
            for (int i = 0; i < CONST_BLOCK_SIZE; i++) {
                shared_mem_input_2[i] = i;
            }
            // Crashes here END

            //auto memEvent2 = q.mem_advise(shared_mem_input_2, CONST_BLOCK_SIZE, HW_SPECIFIC_ADVICE_RO);
            auto memEvent2 = q.prefetch(shared_mem_input_2, CONST_BLOCK_SIZE);
            event running_event_2 = q.submit([&](handler& h) {
                h.depends_on(memEvent2);

                accessor device_result(result_buffer, h, write_only);
                accessor device_parameter(parameter_buffer, h, read_only);

                h.parallel_for(range{ CONST_BLOCK_SIZE }, [=](id<1> pIndex) {
                    int val = shared_mem_input_2[pIndex];
                    int quadratic = val * val;
                    if (quadratic == device_parameter[0])
                    {
                        device_result[0] = val;
                    }
                    });
                });


            q.wait();
        }
        outer_result = result[0];

    }
    catch (exception const& e) {
        std::cout << "An exception is caught for vector add.\n";
        std::terminate();
    }

    if (outer_result > 0)
    {
        printf("%i is the square root of %i\n", outer_result, CONST_QUADRATIC_NUMBER);
    }
    else
    {
        printf("Square root of %i was not found\n", CONST_QUADRATIC_NUMBER);
    }

    return 0;
}





0 Kudos
Sravani_K_Intel
Moderator
340 Views

Thanks for the reproducer, I'll investigate this further to see what's going on.

0 Kudos
Sravani_K_Intel
Moderator
280 Views

Hi @mhogstrom,

I cannot reproduce the crash even when run through VS IDE. 

Sravani_K_Intel_0-1734994656037.png

 

Can you try upgrading to the latest Compiler(2025.0.4) and driver(32.0.101.6129) and give this another shot?

0 Kudos
YiHang
Beginner
185 Views

Hi, I encountered the same problem on B580.

Switching to the OpenCL backend did not work, and the same issue persists.

@Sravani_K_Intel 

 

Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205)
Target: x86_64-pc-windows-msvc
Thread model: posix

Driver:32.0.101.6256

 

 

#include <vector>
#include <iostream>
#include <sycl/sycl.hpp>

int main()try
{
    sycl::queue q{sycl::default_selector_v};

    constexpr auto size = 128;

    std::cout << "Device: " << q.get_device().get_info<sycl::info::device::name>() << '\n'
        << "Backend: " << q.get_backend() << '\n';
    std::vector a(size, 1, sycl::usm_allocator<int, sycl::usm::alloc::shared>{q});
    std::vector b(size, 1, sycl::usm_allocator<int, sycl::usm::alloc::shared>{q});
    std::vector c(size, 1, sycl::usm_allocator<int, sycl::usm::alloc::shared>{q});

    auto A = a.data();
    auto B = b.data();
    auto C = c.data();

    q.submit([&] (sycl::handler& h) {
        h.parallel_for(sycl::range<1>(size),
                       [=] (sycl::id<1> idx) { C[idx] = A[idx] + B[idx]; });
    }).wait();

    for (int i = 0; i < size; i++)
        std::cout << c[i] << std::endl;

} catch (const std::exception& e)
{
    std::cerr << e.what() << '\n';
}

 

 


The VS debugger prompts an exception: read access violation.

image.png

image.png

0 Kudos
Sravani_K_Intel
Moderator
94 Views

Hi @YiHang and @mhogstrom

 

For USM shared memory, the implicit migration of shared allocations between device and host is implemented using access violation mechanisms to identify access from the host. This is why the driver encounters the access violation only when running under the debugger. Pressing "Continue/continue debugging"  should allow the driver to handle this flow properly without crashing the application. 

0 Kudos
YiHang
Beginner
54 Views

Thank you for clarifying this, it is indeed the case.

0 Kudos
Reply