Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
160 Views

Reductions acting weird

Hi,

I have some issues with reductions that I don't really understand. Let's first take a look at the short sample program I used to try to understand these issues:

#include "CL/sycl.hpp"
#include <iostream>
#include <array>

using std::array;
using namespace cl::sycl;

constexpr auto dp_read = access::mode::read;
constexpr auto dp_write = access::mode::write;

int main() {
  cpu_selector device_selector;
  // cl::sycl::gpu_selector device_selector;
  queue q(device_selector);

  constexpr unsigned size = 125000;
  constexpr unsigned workGroupSize = 250;
  range workGroupRange{workGroupSize};

  std::cout << "workGroupSize: " << workGroupSize << '\n';

  array<double, size> a;

  for (int i = 0; i < size; i++) {
    a[i] = 2.5 * i;
  } // this for-loop assures the maximum value of the array is at a[size-1]

  buffer a_buf{a};
  double max;
  {
    buffer max_buf{&max, cl::sycl::range{1}};
    q.submit([&](cl::sycl::handler &h) {
      auto a_acc = a_buf.get_access<dp_read>(h);
      auto max_acc = accessor<double, 0, access::mode::discard_write, access::target::global_buffer>(max_buf, h);

      h.parallel_for(cl::sycl::nd_range<1>{cl::sycl::range(size), workGroupRange}, ONEAPI::reduction(max_acc, ONEAPI::maximum<double>()),
        [=](nd_item<1> it, auto &part_max) {
          part_max.combine(a_acc[it.get_global_id()]);
      });
    });
  }

  
  std::cout << std::boolalpha << (a[size - 1] == max) << std::endl;

  return (0);
}

 The problem is that depending on the size and workGroupSize chosen and whether I select the cpu_selector or the gpu_selector, this code returns true (i.e. the reduction found the maximum value), false (i.e. the reduction returned the wrong value, happens on the GPU) or it throws an OpenCL error (-5, CL_OUT_OF_RESOURCES, happens on the CPU), and I don't understand why only some of the combinations of size and workGroupSize work (I should note that I made sure size is always divisible by workGroupSize). 

 

So my questions would be:

  • Why does the number of items per work group influence whether or not I get a CL_OUT_OF_RESOURCES error on the CPU?
  • Similarly, why does the number of items per work group influence whether or not the reduction returns the correct value on the GPU?
  • Why does it sometimes return a wrong number at all on the GPU?

A table with the values I put in for size and workGroupSize and the result (true/false/error) is attached.

 

My environment is as follows:

  • KDE Neon (essentially Ubuntu 20.04 with KDE applications)
  • Intel oneAPI Base Toolkit (installed as intel-basekit via the Intel repo for Ubuntu), version 2021.1-2261.beta10
  • Compiled using dpcpp (I usedd the CMake sample project for Linux as a basis)

 

I run this on my laptop (the CPU is an Intel Core i5-9300H, the GPU the integrated Intel UHD 630)

 

0 Kudos
3 Replies
Highlighted
Moderator
130 Views

Hi,


The issue is reproducible in my environment with larger input sizes. Please note that I'm investigating on this and will get back to you with the updates.


Thanks for reporting this issue.


Regards,

Rahul


0 Kudos
Highlighted
Beginner
89 Views

Is there any update on the workgroup size issue on a CPU target?

Having the same issue on Windows, tested on beta09 and beta10.

Also, enabling optimization allows a larger workgroup size to pass without CL_OUT_OF_RESOURCES.

Note: Reproducible with just parallel_for(nd_range), without other intrinsic/patterns.

0 Kudos
Highlighted
Moderator
58 Views

Hi,


Apologies for the delay.


The issue is reproducible at my end even with beta10. I have escalated this issue to the concerned team for a fix.


Thanks for reporting this.


Regards,

Rahul