Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
809 Discussions

Unable to Simultaneously Execute Multiple Kernels on Device Using Out-of-Order Queue

-Light-
Novice
899 Views

Hello,
I have been using SYCL programming to perform parallel computations, but I have noticed that kernel functions are not executing in parallel. I found a relevant document (Intel Optimization Guide for GPU) and conducted tests based on the example program provided there.I don't know why the code can't be attached, please see the comments.

I performed tests on two different system environments and found that the execution time using both in-order and out-of-order queues was nearly identical. Upon inspecting the timeline view in Vtune Profiler, I noticed that none of the GPU Computing Tasks executed in parallel, regardless of whether in-order or out-of-order queues were used.

System Environment 1:
12th Gen Intel(R) Core(TM) i7-12700H @ 2.30 GHz
Intel(R) Arc(TM) A370M Graphics
Windows 10 IoT Enterprise 22H2
System Environment 2:
13th Gen Intel(R) Core(TM) i7-13700 @ 2.10 GHz
Intel(R) UHD Graphics 770
Windows 11 Home Chinese Edition

I would appreciate any insights into the reasons behind this issue and possible solutions. Thank you.

0 Kudos
2 Replies
-Light-
Novice
898 Views

The test code is as follows:

#include <CL/sycl.hpp>
#include <chrono>
#include <iostream>

using namespace sycl;

// Assuming IntArray is a custom type defined elsewhere
using IntArray = std::vector<int>;

const size_t array_size = 1024; // Define your array size
const int iter = 10; // Define number of iterations

int multi_queue(sycl::queue& q, const IntArray& a, const IntArray& b) {
IntArray s1(array_size), s2(array_size), s3(array_size);

//buffer<int, 1> a_buf(a.data(), range<1>(array_size));
//buffer<int, 1> b_buf(b.data(), range<1>(array_size));
buffer<int, 1> sum_buf1(s1.data(), range<1>(array_size));
buffer<int, 1> sum_buf2(s2.data(), range<1>(array_size));
buffer<int, 1> sum_buf3(s3.data(), range<1>(array_size));

size_t num_groups = 1;
size_t wg_size = 256;
auto start = std::chrono::steady_clock::now();
for (int i = 0; i < iter; i++) {
q.submit([&](sycl::handler& h) {
//sycl::accessor a_acc(a_buf, h, sycl::read_only);
//sycl::accessor b_acc(b_buf, h, sycl::read_only);
sycl::accessor sum_acc(sum_buf1, h, sycl::write_only, sycl::no_init);

h.parallel_for(sycl::nd_range<1>(num_groups * wg_size, wg_size),
[=](sycl::nd_item<1> index) {
size_t loc_id = index.get_local_id();
sum_acc[loc_id] = 0;
for (int j = 0; j < 1000; j++)
for (size_t i = loc_id; i < array_size; i += wg_size) {
sum_acc[loc_id] += /*a_acc[i] + b_acc[i]*/1;
}
});
});
q.submit([&](sycl::handler& h) {
//sycl::accessor a_acc(a_buf, h, sycl::read_only);
//sycl::accessor b_acc(b_buf, h, sycl::read_only);
sycl::accessor sum_acc(sum_buf2, h, sycl::write_only, sycl::no_init);

h.parallel_for(sycl::nd_range<1>(num_groups * wg_size, wg_size),
[=](sycl::nd_item<1> index) {
size_t loc_id = index.get_local_id();
sum_acc[loc_id] = 0;
for (int j = 0; j < 1000; j++)
for (size_t i = loc_id; i < array_size; i += wg_size) {
sum_acc[loc_id] += /*a_acc[i] + b_acc[i]*/2;
}
});
});
q.submit([&](sycl::handler& h) {
//sycl::accessor a_acc(a_buf, h, sycl::read_only);
//sycl::accessor b_acc(b_buf, h, sycl::read_only);
sycl::accessor sum_acc(sum_buf3, h, sycl::write_only, sycl::no_init);

h.parallel_for(sycl::nd_range<1>(num_groups * wg_size, wg_size),
[=](sycl::nd_item<1> index) {
size_t loc_id = index.get_local_id();
sum_acc[loc_id] = 0;
for (int j = 0; j < 1000; j++)
for (size_t i = loc_id; i < array_size; i += wg_size) {
sum_acc[loc_id] += /*a_acc[i] + b_acc[i]*/3;
}
});
});
}
q.wait();
auto end = std::chrono::steady_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
std::cout << "multi_queue completed on device - took "
<< duration.count() << " u-secs\n";

host_accessor result1(sum_buf1, read_only);
host_accessor result2(sum_buf2, read_only);
host_accessor result3(sum_buf3, read_only);

//for (size_t i = 0; i < 2; ++i) {
// std::cout << "s1[" << i << "] = " << result1[i] << ",";
// std::cout << "s2[" << i << "] = " << result2[i] << ",";
// std::cout << "s3[" << i << "] = " << result3[i] << "\n";
//}

// check results
return ((end - start).count());
} // end multi_queue


int main() {
// Define your device selector
default_selector d_selector;

// Define arrays a and b of type IntArray and populate them
IntArray a(array_size, 1), b(array_size, 1);

// Create in-order queue with queue properties
property_list q_prop{ property::queue::in_order() };

std::cout << "In order queue: Jitting+Execution time\n";
queue q1(d_selector, q_prop);
multi_queue(q1, a, b);
std::this_thread::sleep_for(std::chrono::milliseconds(500));
//usleep(500 * 1000);
std::cout << "In order queue: Execution time\n";
multi_queue(q1, a, b);

// Create out-of-order queue without queue properties
queue q2(d_selector);
std::cout << "Out of order queue: Jitting+Execution time\n";
multi_queue(q2, a, b);
std::this_thread::sleep_for(std::chrono::milliseconds(500));
std::cout << "Out of order queue: Execution time\n";
multi_queue(q2, a, b);

return 0;
}

0 Kudos
-Light-
Novice
809 Views

 

Refer to the relevant documentationt (Intel Optimization Guide for GPU): Executing Multiple Kernels on the Device at the Same Time (intel.com)

0 Kudos
Reply