Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*
718 Discussions

Possible Bug with shared data and queue.prefetch and Platform.getDevices()

mhogstrom
Beginner
350 Views

I am writing an application that makes use of multiple GPUs.
I currently have 4 units of Intel Arc a750 in a rig.
But the problem is reproducible on on my dev desktop having only 1 GPU.

I haven't found a convenient way to select the GPUs. This is how I do.
I call platform::get_platforms()
Select "Intel(R) oneAPI Unified Runtime over Level-Zero" platform
Then call this_platform.get_devices(info::device_type::gpu)
And use the devices when instatiating the queues

queue q(compute_device, handle_async_error); 

constexpr unsigned int BLOCK_SIZE = 16 * 1024 * 1024 * 4 * 2;
char* gpu_data1 = malloc_shared(BLOCK_SIZE, q);
char* gpu_data2 = malloc_shared(BLOCK_SIZE, q);
char* gpu_data3 = malloc_shared(BLOCK_SIZE, q);
char* gpu_data4 = malloc_shared(BLOCK_SIZE, q);

event memEvent2 = q.prefetch(gpu_data1, BLOCK_SIZE); // only gpu_data1

q.submit([&](handler& h) {
h.depends_on(memEvent2);
h.parallel_for(range{ len }, [=](id<1> i) {

// GPU_DATA1 Available

}
}

GPU_DATA2, GPU_DATA3, GPU_DATA4,  are still accessible from the host if using "gpu_selector_v" but if I chose a device via the platform.get_devices()
all of the shared mem (gpu_data 1-4) are unavailable from the host. It seems as the prefetch, prefetched too much. Memory pages are protected (in use by the GPU). I have assured that the device is of the same platform "Intel(R) oneAPI Unified Runtime over Level-Zero". 

I don't understand why there is a difference between gpu_selector_v  and get_devices if it chooses the same device. Either I dont understand prefetch or there might be a bug.

I worked around it using malloc_host and malloc_device, and manually copying the data.
event memEvent2 = q.memcpy(GpuData, HostData, BLOCK_SIZE);
h.depends_on(memEvent2);

Works as a charm.

I am using DPC++ 2025 on Windows

0 Kudos
3 Replies
Sravani_K_Intel
Moderator
156 Views

Can you share a test code which can reproduce the change in prefetch behavior when using gpu_selector_v vs get_devices?

0 Kudos
mhogstrom
Beginner
124 Views

Hmmm

Now I am confused
I still get a crash but I see no difference between gpu_selector_v and get_devices.

Seems that the crash is under the debugger.  (Visual Studio 2022) DPCP 2025

mhogstrom_0-1733827211926.png

If I run it in the OneApi Cmd, it seems to work.

mhogstrom_1-1733827275963.png


Both mem_advice and mem_prefetch seems to crash.
First call to mem_prefetch on buffer 1 invalidates buffer 1 on host side (Correct)
But it also invalidates buffer 2 (on host side), (Wrong)
So I cannot populate it.

Below is just a toy program. I send over an array of numbers.
It calculates the x^2 of the value and compares it to 2500.
In my real program I have a loop that prepares data in buffers.
Then sends a submit with some work,
then prepares new data to submit.
So I use several buffers.

So when I submit a job, I want to prefetch the data for that particular job.
I worked around the crash by using 
host_malloc, device_malloc, and q.memcpy
That variant works also under the debugger


//==============================================================
// Vector Add is the equivalent of a Hello, World! sample for data parallel
// programs. Building and running the sample verifies that your development
// environment is setup correctly and demonstrates the use of the core features
// of SYCL. This sample runs on both CPU and GPU (or FPGA). When run, it
// computes on both the CPU and offload device, then compares results. If the
// code executes on both CPU and offload device, the device name and a success
// message are displayed. And, your development environment is setup correctly!
//
// For comprehensive instructions regarding SYCL Programming, go to
// https://software.intel.com/en-us/oneapi-programming-guide and search based on
// relevant terms noted in the comments.
//
// SYCL material used in the code sample:
// •	A one dimensional array of data.
// •	A device queue, buffer, accessor, and kernel.
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>
#include <vector>
#include <string>
#include <array>

using namespace sycl;

// Create an exception handler for asynchronous SYCL exceptions
static auto exception_handler = [](sycl::exception_list e_list) {
    for (std::exception_ptr const &e : e_list) {
        try {
            std::rethrow_exception(e);
        }
        catch (std::exception const &e) {
#if _DEBUG
        std::cout << "Failure" << std::endl;
#endif
        std::terminate();
        }
    }
};

int main(int argc, char* argv[]) {

    constexpr int CONST_QUADRATIC_NUMBER = 2500;

    int outer_result = -1;
    try {
        queue q(default_selector_v, exception_handler);

        std::cout << "Running on device: "
            << q.get_device().get_info<info::device::name>() << "\n";
        constexpr int CONST_BLOCK_SIZE = 1000;
        auto shared_mem_input_1 = malloc_shared<int>(CONST_BLOCK_SIZE, q);

        // shared_mem_input_2 becomes inaccessible
        auto shared_mem_input_2 = malloc_shared<int>(CONST_BLOCK_SIZE, q);

        std::array<int, 1> result = { 0 };
        

        std::array<int, 1> parameter = { 0 };
        parameter[0] = CONST_QUADRATIC_NUMBER;
        


        for (int i = 0; i < CONST_BLOCK_SIZE; i++) {
            shared_mem_input_1[i] = i;
        }

        {
            buffer result_buffer{ result };
            buffer parameter_buffer{ parameter };
            
            int HW_SPECIFIC_ADVICE_RO = 0;
            //auto memEvent1 = q.mem_advise(shared_mem_input_1, CONST_BLOCK_SIZE, HW_SPECIFIC_ADVICE_RO);
            auto memEvent1 = q.prefetch(shared_mem_input_1, CONST_BLOCK_SIZE);
            event running_event_1 = q.submit([&](handler& h) {
                h.depends_on(memEvent1);

                accessor device_result(result_buffer, h, write_only);
                accessor device_parameter(parameter_buffer, h, read_only);


                h.parallel_for(range{ CONST_BLOCK_SIZE }, [=](id<1> pIndex) {
                    int val = shared_mem_input_1[pIndex];
                    int quadratic = val * val;
                    if (quadratic == device_parameter[0])
                    {
                        device_result[0] = val;
                    }
                    });
                });


            // Crashes here BEGIN
            for (int i = 0; i < CONST_BLOCK_SIZE; i++) {
                shared_mem_input_2[i] = i;
            }
            // Crashes here END

            //auto memEvent2 = q.mem_advise(shared_mem_input_2, CONST_BLOCK_SIZE, HW_SPECIFIC_ADVICE_RO);
            auto memEvent2 = q.prefetch(shared_mem_input_2, CONST_BLOCK_SIZE);
            event running_event_2 = q.submit([&](handler& h) {
                h.depends_on(memEvent2);

                accessor device_result(result_buffer, h, write_only);
                accessor device_parameter(parameter_buffer, h, read_only);

                h.parallel_for(range{ CONST_BLOCK_SIZE }, [=](id<1> pIndex) {
                    int val = shared_mem_input_2[pIndex];
                    int quadratic = val * val;
                    if (quadratic == device_parameter[0])
                    {
                        device_result[0] = val;
                    }
                    });
                });


            q.wait();
        }
        outer_result = result[0];

    }
    catch (exception const& e) {
        std::cout << "An exception is caught for vector add.\n";
        std::terminate();
    }

    if (outer_result > 0)
    {
        printf("%i is the square root of %i\n", outer_result, CONST_QUADRATIC_NUMBER);
    }
    else
    {
        printf("Square root of %i was not found\n", CONST_QUADRATIC_NUMBER);
    }

    return 0;
}





0 Kudos
Sravani_K_Intel
Moderator
50 Views

Thanks for the reproducer, I'll investigate this further to see what's going on.

0 Kudos
Reply