hidden text to trigger early load of fonts ПродукцияПродукцияПродукцияПродукция Các sản phẩmCác sản phẩmCác sản phẩmCác sản phẩm المنتجاتالمنتجاتالمنتجاتالمنتجات מוצריםמוצריםמוצריםמוצרים
Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*
754 Discussions

icpx 2024.2.1 sycl event dependency ignored across queues

Jakub_H
New Contributor I
925 Views

Hello,

I think I encountered a possible bug in the 2024.2.1 icpx compiler (or associated SYCL/DPC++ libraries). In short, and event dependency of a single_task is being ignored.

I am using SYCL to GPU-accelerate parts of our application, and stubmled upon a problem. I was able to boil it down to the example code below.

The code uses two in_order queues. First, I submit a single_task into the first queue, which repeatedly increments a variable for a given number of iterations. At the end of this task, `*val` should contain the number of iterations, i.e., 100000000 (100 million). This task takes ~250ms for me. I take the event representing that task.

Into the second queue, I submit a short task which simply reads the value of `*val` and stores in in `*observed`. I pass the before-mentioned event as a dependency of this task, so this second task should only start after the first task has finished. Therefore, the value this task reads from `*val` should always be the number of iterations, 100000000.

However, this is not the output I observe. The value read in the second task is much lower, which means that the second task must have started execution while the first task was still executing. Sometimes the observed value is a complete mess, as if the second task started even before the q1.fill, which initializes `*val`.

But the tasks are connected with events, so this should not happen.

I came accross `sycl_ext_oneapi_in_order_queue_events` (https://github.com/intel/llvm/blob/19072756e/sycl/doc/extensions/experimental/sycl_ext_oneapi_in_order_queue_events.asciidoc), and tried to use that to create the dependency between the queues (see the commented part of the code), but the behavior is still wrong.

I compile the code using

 

icpx -O0 -fsycl source.cpp -o program.x

 

Run using a simple `./program.x`

Expected output: `Observed 100000000`

Actual output: `Observed 28931`, or other number other than 100000000

I primarily use the training nodes on the tiber devcloud (ITDC, https://console.cloud.intel.com/training -> launch JupyterLab). Intel toolkit version 2024.2.1:

 

$ icpx --version
Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2024.2/bin/compiler
Configuration file: /opt/intel/oneapi/compiler/2024.2/bin/compiler/../icpx.cfg

 

I also tried the 2024.1.4 version on the devcloud (`source /opt/intel/oneapi/2024.1/oneapi-vars.sh --force`), and it ran fine without any issue, as expected. On another cluster, I tried this with 2024.2.0 and other versions, and the issue was also not there, it worked just fine. So the issue seems to be specific only to the 2024.2.1 version.

In the example I use a CPU as device (`sycl::cpu_selector_v`), on the devcloud I also tried `sycl::gpu_selector_v`, but the behavior was the same (with a lower the number of iterations to make it quicker).

Thanks in advance for help/investigation. Feel free to ask for clarifications.

Jakub

 

and here is the example code:

 

#include <cstdio>
#include <sycl/sycl.hpp>

int main()
{
    int num_iters = 100000000;

    sycl::device d(sycl::cpu_selector_v);

    sycl::queue q1(d, sycl::property::queue::in_order());
    sycl::queue q2(d, sycl::property::queue::in_order());

    int * val = sycl::malloc_device<int>(1, q1);
    int * observed = sycl::malloc_device<int>(1, q1);

    q1.fill<int>(val, 0, 1);

    { // connecting the queues using a usual event
        // q1 long running task
        sycl::event e = q1.single_task([=](){
            for(int i = 0; i < num_iters; i++) {
                (*val)++;
            }
        });

        // q2 task
        q2.single_task(e, [=](){
            *observed = *val;
        });
    }
    // { // connecting the queues using sycl_ext_oneapi_in_order_queue_events
    //     // q1 long running task
    //     q1.single_task([=](){
    //         for(int i = 0; i < num_iters; i++) {
    //             (*val)++;
    //         }
    //     });
    //     sycl::event e = q1.ext_oneapi_get_last_event();
    //     q2.ext_oneapi_set_external_event(e);
    //     // q2 task
    //     q2.single_task([=](){
    //         *observed = *val;
    //     });
    // }

    // copy observed value
    int observed_host;
    q2.copy<int>(observed, &observed_host, 1);

    q1.wait();
    q2.wait();

    printf("Observed %d\n", observed_host);

    sycl::free(val, q1);
    sycl::free(observed, q1);

    return 0;
}

 

 

 

 

 

 

 

0 Kudos
4 Replies
Alex_Y_Intel
Moderator
861 Views

Thanks for the detailed explanation and reproducer. I'm escalating your issue to our internal team for further investigation. 

Alex_Y_Intel
Moderator
851 Views

In the meantime, can you try adding a q1.wait(); before

q2.single_task(e, [=](){
*observed = *val;
});

in line 26? 

0 Kudos
Jakub_H
New Contributor I
823 Views

I tried that already. With the q1.wait in between the tasks, it works correctly. But the program is then not fully asynchronous.

0 Kudos
Alex_Y_Intel
Moderator
558 Views

This issue has been fixed and the fix will be in the future release next year, likely around March.


0 Kudos
Reply