Re: Possible Bug with shared data and queue.prefetch and Platform.getDevices()

mhogstrom · ‎11-28-2024

I am writing an application that makes use of multiple GPUs.
I currently have 4 units of Intel Arc a750 in a rig.
But the problem is reproducible on on my dev desktop having only 1 GPU.

I haven't found a convenient way to select the GPUs. This is how I do.
I call platform::get_platforms()
Select "Intel(R) oneAPI Unified Runtime over Level-Zero" platform
Then call this_platform.get_devices(info::device_type::gpu)
And use the devices when instatiating the queues

queue q(compute_device, handle_async_error);

constexpr unsigned int BLOCK_SIZE = 16 * 1024 * 1024 * 4 * 2;
char* gpu_data1 = malloc_shared(BLOCK_SIZE, q);
char* gpu_data2 = malloc_shared(BLOCK_SIZE, q);
char* gpu_data3 = malloc_shared(BLOCK_SIZE, q);
char* gpu_data4 = malloc_shared(BLOCK_SIZE, q);

event memEvent2 = q.prefetch(gpu_data1, BLOCK_SIZE); // only gpu_data1

q.submit([&](handler& h) {
h.depends_on(memEvent2);
h.parallel_for(range{ len }, [=](id<1> i) {

// GPU_DATA1 Available

}
}

GPU_DATA2, GPU_DATA3, GPU_DATA4, are still accessible from the host if using "gpu_selector_v" but if I chose a device via the platform.get_devices()
all of the shared mem (gpu_data 1-4) are unavailable from the host. It seems as the prefetch, prefetched too much. Memory pages are protected (in use by the GPU). I have assured that the device is of the same platform "Intel(R) oneAPI Unified Runtime over Level-Zero".

I don't understand why there is a difference between gpu_selector_v and get_devices if it chooses the same device. Either I dont understand prefetch or there might be a bug.

I worked around it using malloc_host and malloc_device, and manually copying the data.
event memEvent2 = q.memcpy(GpuData, HostData, BLOCK_SIZE);
h.depends_on(memEvent2);

Works as a charm.

I am using DPC++ 2025 on Windows

Sravani_K_Intel · ‎12-09-2024

Can you share a test code which can reproduce the change in prefetch behavior when using gpu_selector_v vs get_devices?

mhogstrom · ‎12-10-2024

Hmmm

Now I am confused
I still get a crash but I see no difference between gpu_selector_v and get_devices.

Seems that the crash is under the debugger. (Visual Studio 2022) DPCP 2025

If I run it in the OneApi Cmd, it seems to work.

Both mem_advice and mem_prefetch seems to crash.
First call to mem_prefetch on buffer 1 invalidates buffer 1 on host side (Correct)
But it also invalidates buffer 2 (on host side), (Wrong)
So I cannot populate it.

Below is just a toy program. I send over an array of numbers.
It calculates the x^2 of the value and compares it to 2500.
In my real program I have a loop that prepares data in buffers.
Then sends a submit with some work,
then prepares new data to submit.
So I use several buffers.

So when I submit a job, I want to prefetch the data for that particular job.
I worked around the crash by using
host_malloc, device_malloc, and q.memcpy
That variant works also under the debugger

//==============================================================
// Vector Add is the equivalent of a Hello, World! sample for data parallel
// programs. Building and running the sample verifies that your development
// environment is setup correctly and demonstrates the use of the core features
// of SYCL. This sample runs on both CPU and GPU (or FPGA). When run, it
// computes on both the CPU and offload device, then compares results. If the
// code executes on both CPU and offload device, the device name and a success
// message are displayed. And, your development environment is setup correctly!
//
// For comprehensive instructions regarding SYCL Programming, go to
// https://software.intel.com/en-us/oneapi-programming-guide and search based on
// relevant terms noted in the comments.
//
// SYCL material used in the code sample:
// •	A one dimensional array of data.
// •	A device queue, buffer, accessor, and kernel.
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>
#include <vector>
#include <string>
#include <array>

using namespace sycl;

// Create an exception handler for asynchronous SYCL exceptions
static auto exception_handler = [](sycl::exception_list e_list) {
    for (std::exception_ptr const &e : e_list) {
        try {
            std::rethrow_exception(e);
        }
        catch (std::exception const &e) {
#if _DEBUG
        std::cout << "Failure" << std::endl;
#endif
        std::terminate();
        }
    }
};

int main(int argc, char* argv[]) {

    constexpr int CONST_QUADRATIC_NUMBER = 2500;

    int outer_result = -1;
    try {
        queue q(default_selector_v, exception_handler);

        std::cout << "Running on device: "
            << q.get_device().get_info<info::device::name>() << "\n";
        constexpr int CONST_BLOCK_SIZE = 1000;
        auto shared_mem_input_1 = malloc_shared<int>(CONST_BLOCK_SIZE, q);

        // shared_mem_input_2 becomes inaccessible
        auto shared_mem_input_2 = malloc_shared<int>(CONST_BLOCK_SIZE, q);

        std::array<int, 1> result = { 0 };
        

        std::array<int, 1> parameter = { 0 };
        parameter[0] = CONST_QUADRATIC_NUMBER;
        


        for (int i = 0; i < CONST_BLOCK_SIZE; i++) {
            shared_mem_input_1[i] = i;
        }

        {
            buffer result_buffer{ result };
            buffer parameter_buffer{ parameter };
            
            int HW_SPECIFIC_ADVICE_RO = 0;
            //auto memEvent1 = q.mem_advise(shared_mem_input_1, CONST_BLOCK_SIZE, HW_SPECIFIC_ADVICE_RO);
            auto memEvent1 = q.prefetch(shared_mem_input_1, CONST_BLOCK_SIZE);
            event running_event_1 = q.submit([&](handler& h) {
                h.depends_on(memEvent1);

                accessor device_result(result_buffer, h, write_only);
                accessor device_parameter(parameter_buffer, h, read_only);


                h.parallel_for(range{ CONST_BLOCK_SIZE }, [=](id<1> pIndex) {
                    int val = shared_mem_input_1[pIndex];
                    int quadratic = val * val;
                    if (quadratic == device_parameter[0])
                    {
                        device_result[0] = val;
                    }
                    });
                });


            // Crashes here BEGIN
            for (int i = 0; i < CONST_BLOCK_SIZE; i++) {
                shared_mem_input_2[i] = i;
            }
            // Crashes here END

            //auto memEvent2 = q.mem_advise(shared_mem_input_2, CONST_BLOCK_SIZE, HW_SPECIFIC_ADVICE_RO);
            auto memEvent2 = q.prefetch(shared_mem_input_2, CONST_BLOCK_SIZE);
            event running_event_2 = q.submit([&](handler& h) {
                h.depends_on(memEvent2);

                accessor device_result(result_buffer, h, write_only);
                accessor device_parameter(parameter_buffer, h, read_only);

                h.parallel_for(range{ CONST_BLOCK_SIZE }, [=](id<1> pIndex) {
                    int val = shared_mem_input_2[pIndex];
                    int quadratic = val * val;
                    if (quadratic == device_parameter[0])
                    {
                        device_result[0] = val;
                    }
                    });
                });


            q.wait();
        }
        outer_result = result[0];

    }
    catch (exception const& e) {
        std::cout << "An exception is caught for vector add.\n";
        std::terminate();
    }

    if (outer_result > 0)
    {
        printf("%i is the square root of %i\n", outer_result, CONST_QUADRATIC_NUMBER);
    }
    else
    {
        printf("Square root of %i was not found\n", CONST_QUADRATIC_NUMBER);
    }

    return 0;
}

Sravani_K_Intel · ‎12-18-2024

Thanks for the reproducer, I'll investigate this further to see what's going on.

Sravani_K_Intel · ‎12-23-2024

Hi @mhogstrom,

I cannot reproduce the crash even when run through VS IDE.

Can you try upgrading to the latest Compiler(2025.0.4) and driver(32.0.101.6129) and give this another shot?

YiHang · ‎01-04-2025

Hi, I encountered the same problem on B580.

Switching to the OpenCL backend did not work, and the same issue persists.

@Sravani_K_Intel

Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205)
Target: x86_64-pc-windows-msvc
Thread model: posix

Driver：32.0.101.6256

#include <vector>
#include <iostream>
#include <sycl/sycl.hpp>

int main()try
{
    sycl::queue q{sycl::default_selector_v};

    constexpr auto size = 128;

    std::cout << "Device: " << q.get_device().get_info<sycl::info::device::name>() << '\n'
        << "Backend: " << q.get_backend() << '\n';
    std::vector a(size, 1, sycl::usm_allocator<int, sycl::usm::alloc::shared>{q});
    std::vector b(size, 1, sycl::usm_allocator<int, sycl::usm::alloc::shared>{q});
    std::vector c(size, 1, sycl::usm_allocator<int, sycl::usm::alloc::shared>{q});

    auto A = a.data();
    auto B = b.data();
    auto C = c.data();

    q.submit([&] (sycl::handler& h) {
        h.parallel_for(sycl::range<1>(size),
                       [=] (sycl::id<1> idx) { C[idx] = A[idx] + B[idx]; });
    }).wait();

    for (int i = 0; i < size; i++)
        std::cout << c[i] << std::endl;

} catch (const std::exception& e)
{
    std::cerr << e.what() << '\n';
}

The VS debugger prompts an exception: read access violation.

Sravani_K_Intel · ‎01-07-2025

Hi @YiHang and @mhogstrom,

For USM shared memory, the implicit migration of shared allocations between device and host is implemented using access violation mechanisms to identify access from the host. This is why the driver encounters the access violation only when running under the debugger. Pressing "Continue/continue debugging" should allow the driver to handle this flow properly without crashing the application.

YiHang · ‎01-10-2025

Thank you for clarifying this, it is indeed the case.