- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Intel team! Happy to be here for the first time.
I have this code:
#pragma acc kernels
#pragma acc loop seq
for(i=0; i<bands; i++)
{
mean=0;
#pragma acc loop seq
for(j=0; j<N; j++)
mean+=(image[(i*N)+j]);
mean/=N;
meanSpect[i]=mean;
#pragma acc loop
for(j=0; j<N; j++)
image[(i*N)+j]=image[(i*N)+j]-mean;
}
As you can see, the first loop is told to be executed in sequence / single thread mode, the first loop inside too, but the last one can be parallelized so I do that.
My question is, how do I translate this to SYCL? Do I put everything inside one q.submit() and then inside create a parallel_for() only for the parallel region? Would that be possible (and correct)?
Second question, the above code continues as follows:
#pragma acc parallel loop collapse(2)
for(j=0; j<bands; j++)
for(i=0; i<bands; i++)
Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);
How do I indicate the collapse() tag in SYCL? Does it exist or do I have to program it in other way?
Thank you very much in advance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
In the below code snippet, SYCL runtime decides the division of work items to the workgroup. Hence it won't run into the issue of exceeding max number of work items per workgroup.
q.submit([&](auto &h) {
h.parallel_for(range<2>(bands_sycl,bands_sycl), [=](auto index) {
int i = index[1];
int j = index[0];
Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);
});
}).wait();
Thanks,
Pradeep
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for reaching out to us!
>>My question is, how do I translate this to SYCL? Do I put everything inside one q.submit() and then inside create a parallel_for() only for the parallel region? Would that be possible (and correct)?
Only parallel code can be submitted to GPU/accelerator through q.submit(). It is not suggested to keep serial code inside q.submit() because it is used to handle dependencies (memory and accessors) for launching the kernel. The code can be parallelized as below
//#pragma acc kernels
//#pragma acc loop seq
for(i=0; i<bands; i++)
{
mean=0;
//#pragma acc loop seq
for(j=0; j<N; j++)
mean+=(image[(i*N)+j]);
mean/=N;
meanSpect[i]=mean;
// #pragma acc loop
q.submit([&](auto &h) {
// Execute kernel.
h.parallel_for(range(N), [=](auto j) {
image[(i*N)+j]=image[(i*N)+j]-mean;
});
}).wait();
}
>>Second question, the above code continues as follows:
How do I indicate the collapse() tag in SYCL? Does it exist or do I have to program it in other way?
In SYCL, there is no collapse tag. It's functionality can be achieved by programming as below
// pragma acc parallel loop collapse(2)
q.submit([&](auto &h) {
// Execute kernel.
h.parallel_for(range(bands, bands), [=](item<2> index) {
int j = index.get_global_id(0);
int i = index.get_global_id(1);
Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);
});
}).wait();
Kindly refer below Intel oneAPI Developer's guide
Note: Please create memory objects and manage their accessors as needed.
Regards,
Pradeep
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Pradeep, thanks for the detailed answer.
The first part worked great! But I have a problem with the second.
It seems that this:
q.submit([&](auto &h) {
// Execute kernel.
h.parallel_for(range(bands, bands), [=](item<2> index) {
int j = index.get_global_id(0); int i = index.get_global_id(1);
Isn't possible, as get_global_id(int dimension) is from nd_item, not item. In the item class there is get_id(int dimension), does it return the same value?
If not, I believe I have to change range to nd_range in order to be used with nd_item right? I tried to but couldn't find the right way of constructing it.
Hopefully you can help me with the above information.
I ended up coming up with this, but in the execution I get a fatal error:
const int bands_sycl = 188;
my_queue.submit([&](auto &h) {
h.parallel_for(sycl::nd_range<2>(sycl::range{bands_sycl,bands_sycl}, sycl::range{bands_sycl,bands_sycl}), [=](sycl::nd_item<2> index) {
int j = index.get_global_id(0);
int i = index.get_global_id(1);
Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);
});
}).wait();
The error:
terminate called after throwing an instance of 'cl::sycl::nd_range_error'
what(): Total number of work-items in a work-group cannot exceed 8192 for this kernel -54 (CL_INVALID_WORK_GROUP_SIZE)
Aborted
Looking forward to your answer! And thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
As per the error message, The work items per workgroup requested are more than the maximum workgroup size of the Hardware.
That's why we are observing this issue.
At any given moment, the maximum number of work items within the workgroup shouldn't exceed the max workgroup size.
In Intel oneAPI DevCloud, "clinfo" command gives all the necessary (max workitem size, max workgroup size ... ) information related to the device through a command line.
The same can be queried from the DPC++/SYCL code using the below APIs.
- device.get_info<cl::sycl::info::device::max_work_group_size>()
- device.get_info<cl::sycl::info::device::max_work_item_sizes>()
Thanks & Regards,
Pradeep
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Pradeep, then how could I program this then without exceeding the max number of workitems per workgroup?
const int bands_sycl = 188;
my_queue.submit([&](auto &h) {
h.parallel_for(sycl::nd_range<2>(sycl::range{bands_sycl,bands_sycl}, sycl::range{bands_sycl,bands_sycl}), [=](sycl::nd_item<2> index) {
int j = index.get_global_id(0);
int i = index.get_global_id(1);
Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);
});
}).wait();
Should I make it one dimensional and then put the "j" original for loop as a single task, for example?
#pragma acc parallel loop collapse(2)
for(j=0; j<bands; j++) <- This loop would be parallelized
for(i=0; i<bands; i++) <- This loop would be executed serially on the device
Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);
Looking forward to your answer.
Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
In the below code snippet, SYCL runtime decides the division of work items to the workgroup. Hence it won't run into the issue of exceeding max number of work items per workgroup.
q.submit([&](auto &h) {
h.parallel_for(range<2>(bands_sycl,bands_sycl), [=](auto index) {
int i = index[1];
int j = index[0];
Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);
});
}).wait();
Thanks,
Pradeep
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for accepting our solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
Any further interaction in this thread will be considered community only.
Regards,
Pradeep
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page