Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*
594 Discussions

How do I correctly program this kernel in SYCL from OpenACC?

gamersensual14
New Contributor I
1,250 Views

Hi Intel team! Happy to be here for the first time.

 

I have this code:

#pragma acc kernels
#pragma acc loop seq
for(i=0; i<bands; i++)
{
mean=0;

#pragma acc loop seq
for(j=0; j<N; j++)
mean+=(image[(i*N)+j]);

mean/=N;
meanSpect[i]=mean;

#pragma acc loop
for(j=0; j<N; j++)
image[(i*N)+j]=image[(i*N)+j]-mean;
}

 As you can see, the first loop is told to be executed in sequence / single thread mode, the first loop inside too, but the last one can be parallelized so I do that.

My question is, how do I translate this to SYCL? Do I put everything inside one q.submit() and then inside create a parallel_for() only for the parallel region? Would that be possible (and correct)?

 

Second question, the above code continues as follows:

#pragma acc parallel loop collapse(2)
for(j=0; j<bands; j++)
for(i=0; i<bands; i++)
Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);

How do I indicate the collapse() tag in SYCL? Does it exist or do I have to program it in other way?

 

Thank you very much in advance.

0 Kudos
1 Solution
PradeepP_Intel
Moderator
1,110 Views

Hi,

 

In the below code snippet, SYCL runtime decides the division of work items to the workgroup. Hence it won't run into the issue of exceeding max number of work items per workgroup.

q.submit([&](auto &h) {
   h.parallel_for(range<2>(bands_sycl,bands_sycl), [=](auto index) {
     int i = index[1];
     int j = index[0];
     Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);
   });
}).wait();

 

Thanks,

Pradeep

 

View solution in original post

0 Kudos
7 Replies
PradeepP_Intel
Moderator
1,206 Views

Thanks for reaching out to us!

 

>>My question is, how do I translate this to SYCL? Do I put everything inside one q.submit() and then inside create a parallel_for() only for the parallel region? Would that be possible (and correct)?

 

Only parallel code can be submitted to GPU/accelerator through q.submit(). It is not suggested to keep serial code inside q.submit() because it is used to handle dependencies (memory and accessors) for launching the kernel. The code can be parallelized as below

//#pragma acc kernels
//#pragma acc loop seq
for(i=0; i<bands; i++)
{
  mean=0;
  //#pragma acc loop seq
  for(j=0; j<N; j++)
     mean+=(image[(i*N)+j]);

  mean/=N;
  meanSpect[i]=mean;
  // #pragma acc loop
  q.submit([&](auto &h) {
    // Execute kernel.
    h.parallel_for(range(N), [=](auto j) {
       image[(i*N)+j]=image[(i*N)+j]-mean;
    });
  }).wait();
}

 

>>Second question, the above code continues as follows:

How do I indicate the collapse() tag in SYCL? Does it exist or do I have to program it in other way?

 

In SYCL, there is no collapse tag. It's functionality can be achieved by programming as below

// pragma acc parallel loop collapse(2)
q.submit([&](auto &h) {
  // Execute kernel.
  h.parallel_for(range(bands, bands), [=](item<2> index) {
     int j = index.get_global_id(0);
     int i = index.get_global_id(1);
     Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);
  });
}).wait();

 

Kindly refer below Intel oneAPI Developer's guide

https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-programming-model/data-parallel-c-dpc.html

 

Note: Please create memory objects and manage their accessors as needed.

 

Regards,

Pradeep

0 Kudos
gamersensual14
New Contributor I
1,186 Views

Hi Pradeep, thanks for the detailed answer.

 

The first part worked great! But I have a problem with the second.

 

It seems that this:

q.submit([&](auto &h) {
// Execute kernel.
h.parallel_for(range(bands, bands), [=](item<2> index) {
int j = index.get_global_id(0); int i = index.get_global_id(1);

Isn't possible, as get_global_id(int dimension) is from nd_item, not item. In the item class there is get_id(int dimension), does it return the same value?

If not, I believe I have to change range to nd_range in order to be used with nd_item right? I tried to but couldn't find the right way of constructing it.

 

Hopefully you can help me with the above information.

 

I ended up coming up with this, but in the execution I get a fatal error: 

const int bands_sycl = 188;

my_queue.submit([&](auto &h) {
h.parallel_for(sycl::nd_range<2>(sycl::range{bands_sycl,bands_sycl}, sycl::range{bands_sycl,bands_sycl}), [=](sycl::nd_item<2> index) {
int j = index.get_global_id(0);
int i = index.get_global_id(1);

Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);
});
}).wait();

The error:

terminate called after throwing an instance of 'cl::sycl::nd_range_error'
what(): Total number of work-items in a work-group cannot exceed 8192 for this kernel -54 (CL_INVALID_WORK_GROUP_SIZE)
Aborted

 

Looking forward to your answer! And thank you!

0 Kudos
PradeepP_Intel
Moderator
1,139 Views

Hi,

 

As per the error message, The work items per workgroup requested are more than the maximum workgroup size of the Hardware.

That's why we are observing this issue.

 

At any given moment, the maximum number of work items within the workgroup shouldn't exceed the max workgroup size. 

In Intel oneAPI DevCloud, "clinfo" command gives all the necessary (max workitem size, max workgroup size ... ) information related to the device through a command line.

 

The same can be queried from the DPC++/SYCL code using the below APIs.

  • device.get_info<cl::sycl::info::device::max_work_group_size>()
  • device.get_info<cl::sycl::info::device::max_work_item_sizes>()

 

Thanks & Regards,

Pradeep

 

0 Kudos
gamersensual14
New Contributor I
1,128 Views

Hi Pradeep, then how could I program this then without exceeding the max number of workitems per workgroup?

const int bands_sycl = 188;

my_queue.submit([&](auto &h) {
h.parallel_for(sycl::nd_range<2>(sycl::range{bands_sycl,bands_sycl}, sycl::range{bands_sycl,bands_sycl}), [=](sycl::nd_item<2> index) {
int j = index.get_global_id(0);
int i = index.get_global_id(1);

Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);
});
}).wait();

Should I make it one dimensional and then put the "j" original for loop as a single task, for example?

#pragma acc parallel loop collapse(2)
for(j=0; j<bands; j++) <- This loop would be parallelized
for(i=0; i<bands; i++) <- This loop would be executed serially on the device
Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);

 

Looking forward to your answer.

 

Thank you!

0 Kudos
PradeepP_Intel
Moderator
1,111 Views

Hi,

 

In the below code snippet, SYCL runtime decides the division of work items to the workgroup. Hence it won't run into the issue of exceeding max number of work items per workgroup.

q.submit([&](auto &h) {
   h.parallel_for(range<2>(bands_sycl,bands_sycl), [=](auto index) {
     int i = index[1];
     int j = index[0];
     Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);
   });
}).wait();

 

Thanks,

Pradeep

 

0 Kudos
gamersensual14
New Contributor I
1,099 Views
0 Kudos
PradeepP_Intel
Moderator
1,090 Views

Hi, 


Thanks for accepting our solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Any further interaction in this thread will be considered community only.


Regards,

Pradeep


0 Kudos
Reply