- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am using devcloud to run my dpc++ code on FPGA hardware for accelration. I am using a node that runs Arria 10 OneAPI. I was able to run the fpga_emu file and the results were as expected. When I use FPGA hardware it gives this error:
Caught a SYCL host exception:
Non-uniform work-groups are not supported by the target device -54 (CL_INVALID_WORK_GROUP_SIZE)
terminate called after throwing an instance of 'cl::sycl::nd_range_error'
what(): Non-uniform work-groups are not supported by the target device -54 (CL_INVALID_WORK_GROUP_SIZE)
Aborted
I don't see any problem with the sizes of the work groups.
range<1> num_items{dataset.size()};
res.resize(dataset.size());
buffer dataset_buf(linear_dataset);
buffer curr_test_buf(curr_test);
buffer res_buf(res.data(), num_items);
std::cout<<"submit a job"<<std::endl;
//auto start = std::chrono::high_resolution_clock::now();
{
q.submit([&](handler& h) {
accessor a(dataset_buf, h, read_only);
accessor b(curr_test_buf, h, read_only);
accessor dif(res_buf, h, read_write, no_init);
h.parallel_for_work_group(range<1>(32), range<1>(500), [=](group<1> g) {
g.parallel_for_work_item([&](h_item<1> item) {
int i = item.get_global_id(0);
for (int j = 0; j < 5; ++j) {
dif[i] += (b[j] - a[i * 5 + j]) * (b[j] - a[i * 5 + j]);
}
// out << "i : " << i << " i[0]: " << i[0] << " b: " << b[0] << cl::sycl::endl;
});
});
}).wait();
}
I previously used normal parallel_for like this, and it gave me huge time on FPGA hardware to run, which accelerated nothing actually, that's why I though of work groups:
range<1> num_items{dataset.size()};
std::vector<double>res;
res.resize(dataset.size());
buffer dataset_buf(linear_dataset);
buffer curr_test_buf(curr_test);
buffer res_buf(res.data(), num_items);
std::cout<<"submit a job"<<std::endl;
//auto start = std::chrono::high_resolution_clock::now();
{
q.submit([&](handler& h) {
accessor a(dataset_buf, h, read_only);
accessor b(curr_test_buf, h, read_only);
accessor dif(res_buf, h, read_write, no_init);
h.parallel_for(num_items, [=](auto i) {
// dif[i] = a[i].size() * 1.0;// a[i];
for (int j = 0; j < 5; ++j) {
dif[i] += (b[j] - a[i * 5 + j]) * (b[j] - a[i * 5 + j]);
}
// out << "i : " << i << " i[0]: " << i[0] << " b: " << b[0] << cl::sycl::endl;
});
}).wait();
}
Thanks a lot!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi amaltaha,
Can share with me through email regarding the project that you are trying to run?
I can try to run on my side and see.
Thanks.
Regards,
Aik Eu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Aik Eu!
I wanted speed efficiency, I tried to split the 16,000 samples (each contains 5 features, double precision) into smaller chunks. But it didn't work.
Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi amaltaha,
Do you mean the error still there or due to your handling in design?
Thanks.
Regards,
Aik Eu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page