- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I think I encountered a possible bug in the 2024.2.1 icpx compiler (or associated SYCL/DPC++ libraries). In short, and event dependency of a single_task is being ignored.
I am using SYCL to GPU-accelerate parts of our application, and stubmled upon a problem. I was able to boil it down to the example code below.
The code uses two in_order queues. First, I submit a single_task into the first queue, which repeatedly increments a variable for a given number of iterations. At the end of this task, `*val` should contain the number of iterations, i.e., 100000000 (100 million). This task takes ~250ms for me. I take the event representing that task.
Into the second queue, I submit a short task which simply reads the value of `*val` and stores in in `*observed`. I pass the before-mentioned event as a dependency of this task, so this second task should only start after the first task has finished. Therefore, the value this task reads from `*val` should always be the number of iterations, 100000000.
However, this is not the output I observe. The value read in the second task is much lower, which means that the second task must have started execution while the first task was still executing. Sometimes the observed value is a complete mess, as if the second task started even before the q1.fill, which initializes `*val`.
But the tasks are connected with events, so this should not happen.
I came accross `sycl_ext_oneapi_in_order_queue_events` (https://github.com/intel/llvm/blob/19072756e/sycl/doc/extensions/experimental/sycl_ext_oneapi_in_order_queue_events.asciidoc), and tried to use that to create the dependency between the queues (see the commented part of the code), but the behavior is still wrong.
I compile the code using
icpx -O0 -fsycl source.cpp -o program.x
Run using a simple `./program.x`
Expected output: `Observed 100000000`
Actual output: `Observed 28931`, or other number other than 100000000
I primarily use the training nodes on the tiber devcloud (ITDC, https://console.cloud.intel.com/training -> launch JupyterLab). Intel toolkit version 2024.2.1:
$ icpx --version
Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2024.2/bin/compiler
Configuration file: /opt/intel/oneapi/compiler/2024.2/bin/compiler/../icpx.cfg
I also tried the 2024.1.4 version on the devcloud (`source /opt/intel/oneapi/2024.1/oneapi-vars.sh --force`), and it ran fine without any issue, as expected. On another cluster, I tried this with 2024.2.0 and other versions, and the issue was also not there, it worked just fine. So the issue seems to be specific only to the 2024.2.1 version.
In the example I use a CPU as device (`sycl::cpu_selector_v`), on the devcloud I also tried `sycl::gpu_selector_v`, but the behavior was the same (with a lower the number of iterations to make it quicker).
Thanks in advance for help/investigation. Feel free to ask for clarifications.
Jakub
and here is the example code:
#include <cstdio>
#include <sycl/sycl.hpp>
int main()
{
int num_iters = 100000000;
sycl::device d(sycl::cpu_selector_v);
sycl::queue q1(d, sycl::property::queue::in_order());
sycl::queue q2(d, sycl::property::queue::in_order());
int * val = sycl::malloc_device<int>(1, q1);
int * observed = sycl::malloc_device<int>(1, q1);
q1.fill<int>(val, 0, 1);
{ // connecting the queues using a usual event
// q1 long running task
sycl::event e = q1.single_task([=](){
for(int i = 0; i < num_iters; i++) {
(*val)++;
}
});
// q2 task
q2.single_task(e, [=](){
*observed = *val;
});
}
// { // connecting the queues using sycl_ext_oneapi_in_order_queue_events
// // q1 long running task
// q1.single_task([=](){
// for(int i = 0; i < num_iters; i++) {
// (*val)++;
// }
// });
// sycl::event e = q1.ext_oneapi_get_last_event();
// q2.ext_oneapi_set_external_event(e);
// // q2 task
// q2.single_task([=](){
// *observed = *val;
// });
// }
// copy observed value
int observed_host;
q2.copy<int>(observed, &observed_host, 1);
q1.wait();
q2.wait();
printf("Observed %d\n", observed_host);
sycl::free(val, q1);
sycl::free(observed, q1);
return 0;
}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the detailed explanation and reproducer. I'm escalating your issue to our internal team for further investigation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the meantime, can you try adding a q1.wait(); before
q2.single_task(e, [=](){
*observed = *val;
});
in line 26?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried that already. With the q1.wait in between the tasks, it works correctly. But the program is then not fully asynchronous.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This issue has been fixed and the fix will be in the future release next year, likely around March.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page