Intel® oneAPI Data Parallel C++
Support for Intel® oneAPI DPC++ Compiler, Intel® oneAPI DPC++ Library, Intel ICX Compiler , Intel® DPC++ Compatibility Tool, and GDB*
581 Discussions

Some image sizes cause a kernel deadlock when the last pixel is read

Olivier48
Beginner
1,321 Views

Hello,

I think I found a bug on my system : for some image sizes (multiples of 256 on my system), reading the _last_ pixel causes the kernel to deadlock.

But I wonder if it is a bug or if it is me not using DPC++ correctly?

The program hereunder reproduces this bug and can be used to detect image sizes that are affected by the bug:

#include <CL/sycl.hpp>

#include <algorithm>
#include <iostream>
#include <vector>

/*
* This program reproduces a bug found on my system, where
* for _some_ image sizes, reading the _last_ pixel causes a deadlock.
*/
std::vector<int> affectedImageSizes = { {} };
/*
* I used this procedure to detect the image sizes affected by this bug:
* - run the program
* - if the program deadlocks (the last log in the console is: 'writeback ...'):
*     add the corresponding image size to 'affectedImageSizes' and repeat.
*
* On my system the result is: affectedImageSizes = { {256, 512, 768, 1024, 1280, 1536, 1792} };
* i.e every image whose size is a multiple of 256 is affected by the bug.
*/

constexpr uint32_t Red = 0b10101100111100001111011100001000;
constexpr uint32_t Green = 0b10111101101101001101011000111000;
constexpr uint32_t Blue = 0b00101110111100011101011110001001;
constexpr uint32_t Alpha = 0b00101000111100001101011101001000;

int main()
{
  using namespace cl::sycl;

  auto ehandler = [](exception_list exceptionList) {
    for (std::exception_ptr const& e : exceptionList)
    {
      try
      {
        std::rethrow_exception(e);
      }
      catch (exception const& e)
      {
        std::terminate();
      }
    }
  };

  queue q(gpu_selector(), ehandler);

  int const nMaxItemsPerWorkgroup =
    q.get_device().get_info<info::device::max_work_group_size>();

  for (int nPixels = 1; nPixels < 2000; ++nPixels)
  {
    std::cout << "[With " << nPixels << " pixels]" << std::endl;
    bool const readLastPixel = std::find(affectedImageSizes.begin(), affectedImageSizes.end(), nPixels) == affectedImageSizes.end();

    // Build the input data

    std::vector<uint32_t> rgba;
    rgba.reserve(nPixels * 4);

    for (int i = 0; i < nPixels; ++i)
    {
      rgba.push_back(Red);
      rgba.push_back(Green);
      rgba.push_back(Blue);
      rgba.push_back(Alpha);
    }

    image<1> img(
      rgba.data(),
      image_channel_order::rgba,
      image_channel_type::unsigned_int32,
      nPixels);

    // Verify that the image has the same size (in bytes) as the vector
    if (img.get_size() != sizeof(uint32_t) * rgba.size())
      throw std::logic_error("size mismatch");

    // Build the output data

    std::vector<uint32_t> rgbaCopy(rgba.size(), 0);

    // Use a kernel to copy the input data to the output data
    {
      buffer<uint32_t, 1> imgDataOutBuffer(
        rgbaCopy.data(),
        rgbaCopy.size());

      int const nMaxPixelsPerThread = 1 + (nPixels - 1) / nMaxItemsPerWorkgroup;
      int const pixelStride = 1 + (nPixels - 1) / nMaxPixelsPerThread;

      std::cout << "call kernel with nPixels = '" << nPixels << "', pixelStride '" << pixelStride << "' ..." << std::endl;

      q.submit([&](handler& h) {
        auto aImg = img.get_access<cl::sycl::cl_int4, access::mode::read>(h);
        auto aOutput = imgDataOutBuffer.get_access<access::mode::write>(h);
        auto const r = nd_range<1>{
          range(pixelStride),
          range(pixelStride)
        };
        h.parallel_for<class ReproMinimalDeadlock_Kernel>(
          r,
          [=](nd_item<1> it) {
            // do not read the last pixel if the image size is affected by the bug
            int const endPixel = nPixels - (readLastPixel ? 0 : 1);
            for (int i = it.get_local_id(); i < endPixel; i += pixelStride)
            {
              auto pixel = aImg.read(i);
              aOutput[4 * i + 0] = pixel.s0();
              aOutput[4 * i + 1] = pixel.s1();
              aOutput[4 * i + 2] = pixel.s2();
              aOutput[4 * i + 3] = pixel.s3();
            }
          });
        });
      q.wait();
      std::cout << "writeback ..." << std::endl;
    }
    std::cout << "writeback done." << std::endl;

    // Verify that the ouptut data matches the input data.
    // Skip the last pixel if the image size is affected by the bug.
      
    for (int i = 0, sz = rgbaCopy.size() - (readLastPixel ? 0 : 4); i < sz; ++i)
      if (rgba[i] != rgbaCopy[i])
        throw std::logic_error("in != out");
  }

  return 0;
}

 

My system details are:

Intel One API base toolkit 2021:
- version 2021.1.0-2664

Intel GPU (Infos from "Intel Graphics command center"):
- Graphics processor: Intel UHD Graphics
- Microsoft DirectX: 12
- Graphics Driver: 27.20.100.9316 (most current driver)
- Shader version: 5.1
- Vulkan: 1.2.167
- Graphics Memory: Dedicated O GB, Shared 16GB
- Vendor ID : 8086
- Device ID: 9BC4
- Device Revision: 05
- Graphics Output Protocol Version : 9.0.1105
- Max. Supported Monitors : 3

- Device Specifications:
Processor Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz 2.71 GHz
Installed RAM 32.0 GB (31.6 GB usable)
System type 64-bit operating system, x64-based processor
Pen and touch No pen or touch input is available for this display

- Windows specifications:
Edition Windows 10 Pro
Version 20H2
OS build 19042.804
Experience Windows Feature Experience Pack 120.2212.551.0

Any help on that matter is much appreciated.

The zip filed attached to this post contains the source code and the visual studio project.

Thank you,

Olivier

0 Kudos
1 Solution
Subarnarek_G_Intel
1,154 Views

This issue is already fixed in 2021.2 with the latest driver.


View solution in original post

0 Kudos
6 Replies
RahulV_intel
Moderator
1,266 Views

Hi,

 

The issue is reproducible on windows (with Gen9 iGPU). However, it works fine on the CPU side.

On Linux, I haven't noticed any issue with Gen9 iGPU.

 

We are working on this issue and will get back to you.

 

Thanks,

Rahul

Subarnarek_G_Intel
1,197 Views

Hi Oliver,

I am escalating this issue to the engineering as this issue is reproducible at my end.


Regards,

Subarna


Subarnarek_G_Intel
1,155 Views

This issue is already fixed in 2021.2 with the latest driver.


0 Kudos
Subarnarek_G_Intel
1,149 Views

A kernel lambda passed to parallel_for will be called once for each item in the range. This is true for all invocations of parallel_for, so whether you use nd_range and nd_item, or just regular range and item, you can expect every item in the range to be visited. So, in the case that the ranges passed match the size of your image, then each pixel will be visited one time. Your code seems to expect the kernel to visit each workgroup exactly once, but that is incorrect.


I have attached a code demonstrates this.

#include <CL/sycl.hpp>


using namespace cl::sycl;


int main(){


queue q;


int const nMaxItemsPerWorkgroup = q.get_device().get_info<info::device::max_work_group_size>();

std::cout << "nMaxItemsPerWorkgroup: " << nMaxItemsPerWorkgroup << std::endl;



// 12 items total, 3 per group. which means there will be four groups.

auto const r = nd_range<1>{

range(12), //global

range(3) //local

};


q.submit([&](handler& cgh) {

cl::sycl::stream out(1024, 100, cgh);

cgh.parallel_for<class okay>(r, [=](nd_item<1> it) {

int global_id = it.get_global_id();

int local_id = it.get_local_id();

auto local_r = it.get_local_range();

int group_id = it.get_group_linear_id();

auto group_r = it.get_group_range();


out << "global_id: " << global_id

<< " local_id: " << local_id

<< " local_r: " << local_r.get(0) //1 or 0 ?

<< " group_id: " << group_id

<< " group_r: " << group_r.get(0)

<< cl::sycl::endl;



});

});

q.wait();

return 0;

}


/*

SYCL_DEVICE_FILTER=opencl:cpu ./sim.bin

global_id: 0 local_id: 0 local_r: 3 group_id: 0 group_r: 4

global_id: 1 local_id: 1 local_r: 3 group_id: 0 group_r: 4

global_id: 2 local_id: 2 local_r: 3 group_id: 0 group_r: 4

global_id: 9 local_id: 0 local_r: 3 group_id: 3 group_r: 4

global_id: 10 local_id: 1 local_r: 3 group_id: 3 group_r: 4

global_id: 11 local_id: 2 local_r: 3 group_id: 3 group_r: 4

global_id: 6 local_id: 0 local_r: 3 group_id: 2 group_r: 4

global_id: 7 local_id: 1 local_r: 3 group_id: 2 group_r: 4

global_id: 8 local_id: 2 local_r: 3 group_id: 2 group_r: 4

global_id: 3 local_id: 0 local_r: 3 group_id: 1 group_r: 4

global_id: 4 local_id: 1 local_r: 3 group_id: 1 group_r: 4

global_id: 5 local_id: 2 local_r: 3 group_id: 1 group_r: 4


*/


Additionally, in your code when you set up the nd_range you do so like this:


auto const r = nd_range<1>{

     range(pixelStride), //global  

     range(pixelStride) //local

    };

You are essentially telling SYCL that you have pixelStride items, and want exactly one workgroup with pixelStride items in it. Is that the desire? When I first looked at this I thought perhaps you intended range(nPixels) for the global range. But, if so, then you need to ensure pixelStride evenly divides it, or pass the -cl-std=CL2.0 flag to the compiler to enable the OpenCL support for partial workgroups. (Also note, this requirement may vary by device. The CPU device doesn't care, but GPU devices do).



0 Kudos
Olivier48
Beginner
1,116 Views

Hello Subarnarek_G_Intel,

I ran the program again today, and saw that with the latest drivers, there is no deadlock anymore. So Intel must have done something in the drivers that fixes this bug. Thank you for that!

Preview


@Subarnarek_G_Intel wrote:

You are essentially telling SYCL that you have pixelStride items, and want exactly one workgroup with pixelStride items in it. Is that the desire?

Yes, this is the desire. This code is a reduced version of a more complex code that relies on workgroup barriers to synchronize, so I can only use one workgroup, else I will run into race conditions.

The maximum number of items per workgroup is:
nMaxItemsPerWorkgroup = info::device::max_work_group_size.

Hence, to use at most one workgroup, each thread in the workgroup needs to handle (at most) that many pixels:
nMaxPixelsPerThread = 1 + (nPixels - 1) / nMaxItemsPerWorkgroup

Hence, the stride needs to be:
pixelStride = 1 + (nPixels - 1) / nMaxPixelsPerThread;

 


@Subarnarek_G_Intel wrote:

Your code seems to expect the kernel to visit each workgroup exactly once, but that is incorrect.

I think the code is correct. If after reading the explanations above you still don't think the code is correct please explain in more detail what you think the problem is, and how you would fix it.

Thank you,
Olivier

0 Kudos
Subarnarek_G_Intel
1,124 Views

This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only


0 Kudos
Reply