Unable to Simultaneously Execute Multiple Kernels on Device Using Out-of-Order Queue

-Light- · ‎06-16-2024

Hello,
I have been using SYCL programming to perform parallel computations, but I have noticed that kernel functions are not executing in parallel. I found a relevant document (Intel Optimization Guide for GPU) and conducted tests based on the example program provided there.I don't know why the code can't be attached, please see the comments.

I performed tests on two different system environments and found that the execution time using both in-order and out-of-order queues was nearly identical. Upon inspecting the timeline view in Vtune Profiler, I noticed that none of the GPU Computing Tasks executed in parallel, regardless of whether in-order or out-of-order queues were used.

System Environment 1:
12th Gen Intel(R) Core(TM) i7-12700H @ 2.30 GHz
Intel(R) Arc(TM) A370M Graphics
Windows 10 IoT Enterprise 22H2
System Environment 2:
13th Gen Intel(R) Core(TM) i7-13700 @ 2.10 GHz
Intel(R) UHD Graphics 770
Windows 11 Home Chinese Edition

I would appreciate any insights into the reasons behind this issue and possible solutions. Thank you.

-Light- · ‎06-16-2024

The test code is as follows:

#include <CL/sycl.hpp>
#include <chrono>
#include <iostream>

using namespace sycl;

// Assuming IntArray is a custom type defined elsewhere
using IntArray = std::vector<int>;

const size_t array_size = 1024; // Define your array size
const int iter = 10; // Define number of iterations

int multi_queue(sycl::queue& q, const IntArray& a, const IntArray& b) {
IntArray s1(array_size), s2(array_size), s3(array_size);

//buffer<int, 1> a_buf(a.data(), range<1>(array_size));
//buffer<int, 1> b_buf(b.data(), range<1>(array_size));
buffer<int, 1> sum_buf1(s1.data(), range<1>(array_size));
buffer<int, 1> sum_buf2(s2.data(), range<1>(array_size));
buffer<int, 1> sum_buf3(s3.data(), range<1>(array_size));

size_t num_groups = 1;
size_t wg_size = 256;
auto start = std::chrono::steady_clock::now();
for (int i = 0; i < iter; i++) {
q.submit([&](sycl::handler& h) {
//sycl::accessor a_acc(a_buf, h, sycl::read_only);
//sycl::accessor b_acc(b_buf, h, sycl::read_only);
sycl::accessor sum_acc(sum_buf1, h, sycl::write_only, sycl::no_init);

h.parallel_for(sycl::nd_range<1>(num_groups * wg_size, wg_size),
[=](sycl::nd_item<1> index) {
size_t loc_id = index.get_local_id();
sum_acc[loc_id] = 0;
for (int j = 0; j < 1000; j++)
for (size_t i = loc_id; i < array_size; i += wg_size) {
sum_acc[loc_id] += /*a_acc[i] + b_acc[i]*/1;
}
});
});
q.submit([&](sycl::handler& h) {
//sycl::accessor a_acc(a_buf, h, sycl::read_only);
//sycl::accessor b_acc(b_buf, h, sycl::read_only);
sycl::accessor sum_acc(sum_buf2, h, sycl::write_only, sycl::no_init);

h.parallel_for(sycl::nd_range<1>(num_groups * wg_size, wg_size),
[=](sycl::nd_item<1> index) {
size_t loc_id = index.get_local_id();
sum_acc[loc_id] = 0;
for (int j = 0; j < 1000; j++)
for (size_t i = loc_id; i < array_size; i += wg_size) {
sum_acc[loc_id] += /*a_acc[i] + b_acc[i]*/2;
}
});
});
q.submit([&](sycl::handler& h) {
//sycl::accessor a_acc(a_buf, h, sycl::read_only);
//sycl::accessor b_acc(b_buf, h, sycl::read_only);
sycl::accessor sum_acc(sum_buf3, h, sycl::write_only, sycl::no_init);

h.parallel_for(sycl::nd_range<1>(num_groups * wg_size, wg_size),
[=](sycl::nd_item<1> index) {
size_t loc_id = index.get_local_id();
sum_acc[loc_id] = 0;
for (int j = 0; j < 1000; j++)
for (size_t i = loc_id; i < array_size; i += wg_size) {
sum_acc[loc_id] += /*a_acc[i] + b_acc[i]*/3;
}
});
});
}
q.wait();
auto end = std::chrono::steady_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
std::cout << "multi_queue completed on device - took "
<< duration.count() << " u-secs\n";

host_accessor result1(sum_buf1, read_only);
host_accessor result2(sum_buf2, read_only);
host_accessor result3(sum_buf3, read_only);

//for (size_t i = 0; i < 2; ++i) {
// std::cout << "s1[" << i << "] = " << result1[i] << ",";
// std::cout << "s2[" << i << "] = " << result2[i] << ",";
// std::cout << "s3[" << i << "] = " << result3[i] << "\n";
//}

// check results
return ((end - start).count());
} // end multi_queue

int main() {
// Define your device selector
default_selector d_selector;

// Define arrays a and b of type IntArray and populate them
IntArray a(array_size, 1), b(array_size, 1);

// Create in-order queue with queue properties
property_list q_prop{ property::queue::in_order() };

std::cout << "In order queue: Jitting+Execution time\n";
queue q1(d_selector, q_prop);
multi_queue(q1, a, b);
std::this_thread::sleep_for(std::chrono::milliseconds(500));
//usleep(500 * 1000);
std::cout << "In order queue: Execution time\n";
multi_queue(q1, a, b);

// Create out-of-order queue without queue properties
queue q2(d_selector);
std::cout << "Out of order queue: Jitting+Execution time\n";
multi_queue(q2, a, b);
std::this_thread::sleep_for(std::chrono::milliseconds(500));
std::cout << "Out of order queue: Execution time\n";
multi_queue(q2, a, b);

return 0;
}

-Light- · ‎06-17-2024

Refer to the relevant documentationt (Intel Optimization Guide for GPU)： Executing Multiple Kernels on the Device at the Same Time (intel.com)