Solved: How do I correctly program this kernel in SYCL from OpenACC?

gamersensual14 · ‎04-20-2022

Hi Intel team! Happy to be here for the first time.

I have this code:

#pragma acc kernels
#pragma acc loop seq
for(i=0; i<bands; i++)
{
    mean=0;

    #pragma acc loop seq
    for(j=0; j<N; j++)
        mean+=(image[(i*N)+j]);

    mean/=N;
    meanSpect[i]=mean;

    #pragma acc loop
    for(j=0; j<N; j++)
        image[(i*N)+j]=image[(i*N)+j]-mean;
}

As you can see, the first loop is told to be executed in sequence / single thread mode, the first loop inside too, but the last one can be parallelized so I do that.

My question is, how do I translate this to SYCL? Do I put everything inside one q.submit() and then inside create a parallel_for() only for the parallel region? Would that be possible (and correct)?

Second question, the above code continues as follows:

#pragma acc parallel loop collapse(2)
for(j=0; j<bands; j++)
    for(i=0; i<bands; i++)
        Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);

How do I indicate the collapse() tag in SYCL? Does it exist or do I have to program it in other way?

Thank you very much in advance.

PradeepP_Intel · ‎04-29-2022

Hi,

In the below code snippet, SYCL runtime decides the division of work items to the workgroup. Hence it won't run into the issue of exceeding max number of work items per workgroup.

q.submit([&](auto &h) {
   h.parallel_for(range<2>(bands_sycl,bands_sycl), [=](auto index) {
     int i = index[1];
     int j = index[0];
     Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);
   });
}).wait();

Thanks,

Pradeep

View solution in original post

PradeepP_Intel · ‎04-22-2022

Thanks for reaching out to us!

>>My question is, how do I translate this to SYCL? Do I put everything inside one q.submit() and then inside create a parallel_for() only for the parallel region? Would that be possible (and correct)?

Only parallel code can be submitted to GPU/accelerator through q.submit(). It is not suggested to keep serial code inside q.submit() because it is used to handle dependencies (memory and accessors) for launching the kernel. The code can be parallelized as below

//#pragma acc kernels
//#pragma acc loop seq
for(i=0; i<bands; i++)
{
  mean=0;
  //#pragma acc loop seq
  for(j=0; j<N; j++)
     mean+=(image[(i*N)+j]);

  mean/=N;
  meanSpect[i]=mean;
  // #pragma acc loop
  q.submit([&](auto &h) {
    // Execute kernel.
    h.parallel_for(range(N), [=](auto j) {
       image[(i*N)+j]=image[(i*N)+j]-mean;
    });
  }).wait();
}

>>Second question, the above code continues as follows:

How do I indicate the collapse() tag in SYCL? Does it exist or do I have to program it in other way?

In SYCL, there is no collapse tag. It's functionality can be achieved by programming as below

// pragma acc parallel loop collapse(2)
q.submit([&](auto &h) {
  // Execute kernel.
  h.parallel_for(range(bands, bands), [=](item<2> index) {
     int j = index.get_global_id(0);
     int i = index.get_global_id(1);
     Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);
  });
}).wait();

Kindly refer below Intel oneAPI Developer's guide

https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-programming-model/data-parallel-c-dpc.html

Note: Please create memory objects and manage their accessors as needed.

Regards,

Pradeep

gamersensual14 · ‎04-25-2022

Hi Pradeep, thanks for the detailed answer.

The first part worked great! But I have a problem with the second.

It seems that this:

q.submit([&](auto &h) {
    // Execute kernel.
    h.parallel_for(range(bands, bands), [=](item<2> index) {
        int j = index.get_global_id(0); int i = index.get_global_id(1);

Isn't possible, as get_global_id(int dimension) is from nd_item, not item. In the item class there is get_id(int dimension), does it return the same value?

If not, I believe I have to change range to nd_range in order to be used with nd_item right? I tried to but couldn't find the right way of constructing it.

Hopefully you can help me with the above information.

I ended up coming up with this, but in the execution I get a fatal error:

const int bands_sycl = 188;

my_queue.submit([&](auto &h) {
    h.parallel_for(sycl::nd_range<2>(sycl::range{bands_sycl,bands_sycl}, sycl::range{bands_sycl,bands_sycl}), [=](sycl::nd_item<2> index) {
        int j = index.get_global_id(0);
        int i = index.get_global_id(1);

        Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);
    });
}).wait();

The error:

terminate called after throwing an instance of 'cl::sycl::nd_range_error'
what(): Total number of work-items in a work-group cannot exceed 8192 for this kernel -54 (CL_INVALID_WORK_GROUP_SIZE)
Aborted

Looking forward to your answer! And thank you!

PradeepP_Intel · ‎04-27-2022

Hi,

As per the error message, The work items per workgroup requested are more than the maximum workgroup size of the Hardware.

That's why we are observing this issue.

At any given moment, the maximum number of work items within the workgroup shouldn't exceed the max workgroup size.

In Intel oneAPI DevCloud, "clinfo" command gives all the necessary (max workitem size, max workgroup size ... ) information related to the device through a command line.

The same can be queried from the DPC++/SYCL code using the below APIs.

device.get_info<cl::sycl::info::device::max_work_group_size>()
device.get_info<cl::sycl::info::device::max_work_item_sizes>()

Thanks & Regards,

Pradeep

gamersensual14 · ‎04-28-2022

Hi Pradeep, then how could I program this then without exceeding the max number of workitems per workgroup?

const int bands_sycl = 188;

my_queue.submit([&](auto &h) {
    h.parallel_for(sycl::nd_range<2>(sycl::range{bands_sycl,bands_sycl}, sycl::range{bands_sycl,bands_sycl}), [=](sycl::nd_item<2> index) {
        int j = index.get_global_id(0);
        int i = index.get_global_id(1);

        Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);
    });
}).wait();

Should I make it one dimensional and then put the "j" original for loop as a single task, for example?

#pragma acc parallel loop collapse(2)
for(j=0; j<bands; j++) <- This loop would be parallelized
    for(i=0; i<bands; i++) <- This loop would be executed serially on the device
        Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);

Looking forward to your answer.

Thank you!

PradeepP_Intel · ‎04-29-2022

Hi,

In the below code snippet, SYCL runtime decides the division of work items to the workgroup. Hence it won't run into the issue of exceeding max number of work items per workgroup.

q.submit([&](auto &h) {
   h.parallel_for(range<2>(bands_sycl,bands_sycl), [=](auto index) {
     int i = index[1];
     int j = index[0];
     Corr[(i*bands)+j] = Cov[(i*bands)+j]+(meanSpect[i] * meanSpect[j]);
   });
}).wait();

Thanks,

Pradeep

gamersensual14 · ‎04-29-2022

It worked! Thanks a lot Pradeep!!

PradeepP_Intel · ‎05-01-2022

Hi,

Thanks for accepting our solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Any further interaction in this thread will be considered community only.

Regards,

Pradeep