<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Some image sizes cause a kernel deadlock when the last pixel is read in Intel® oneAPI DPC++/C++ Compiler</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Some-image-sizes-cause-a-kernel-deadlock-when-the-last-pixel-is/m-p/1264402#M1016</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;I think I found a bug on my system : for some image sizes (multiples of 256 on my system), reading the _last_ pixel causes the kernel to deadlock.&lt;/P&gt;
&lt;P&gt;But I wonder if it is a bug or if it is me not using DPC++ correctly?&lt;/P&gt;
&lt;P&gt;The program hereunder reproduces this bug and can be used to detect image sizes that are affected by the bug:&lt;/P&gt;
&lt;LI-CODE lang="cpp"&gt;#include &amp;lt;CL/sycl.hpp&amp;gt;

#include &amp;lt;algorithm&amp;gt;
#include &amp;lt;iostream&amp;gt;
#include &amp;lt;vector&amp;gt;

/*
* This program reproduces a bug found on my system, where
* for _some_ image sizes, reading the _last_ pixel causes a deadlock.
*/
std::vector&amp;lt;int&amp;gt; affectedImageSizes = { {} };
/*
* I used this procedure to detect the image sizes affected by this bug:
* - run the program
* - if the program deadlocks (the last log in the console is: 'writeback ...'):
*     add the corresponding image size to 'affectedImageSizes' and repeat.
*
* On my system the result is: affectedImageSizes = { {256, 512, 768, 1024, 1280, 1536, 1792} };
* i.e every image whose size is a multiple of 256 is affected by the bug.
*/

constexpr uint32_t Red = 0b10101100111100001111011100001000;
constexpr uint32_t Green = 0b10111101101101001101011000111000;
constexpr uint32_t Blue = 0b00101110111100011101011110001001;
constexpr uint32_t Alpha = 0b00101000111100001101011101001000;

int main()
{
  using namespace cl::sycl;

  auto ehandler = [](exception_list exceptionList) {
    for (std::exception_ptr const&amp;amp; e : exceptionList)
    {
      try
      {
        std::rethrow_exception(e);
      }
      catch (exception const&amp;amp; e)
      {
        std::terminate();
      }
    }
  };

  queue q(gpu_selector(), ehandler);

  int const nMaxItemsPerWorkgroup =
    q.get_device().get_info&amp;lt;info::device::max_work_group_size&amp;gt;();

  for (int nPixels = 1; nPixels &amp;lt; 2000; ++nPixels)
  {
    std::cout &amp;lt;&amp;lt; "[With " &amp;lt;&amp;lt; nPixels &amp;lt;&amp;lt; " pixels]" &amp;lt;&amp;lt; std::endl;
    bool const readLastPixel = std::find(affectedImageSizes.begin(), affectedImageSizes.end(), nPixels) == affectedImageSizes.end();

    // Build the input data

    std::vector&amp;lt;uint32_t&amp;gt; rgba;
    rgba.reserve(nPixels * 4);

    for (int i = 0; i &amp;lt; nPixels; ++i)
    {
      rgba.push_back(Red);
      rgba.push_back(Green);
      rgba.push_back(Blue);
      rgba.push_back(Alpha);
    }

    image&amp;lt;1&amp;gt; img(
      rgba.data(),
      image_channel_order::rgba,
      image_channel_type::unsigned_int32,
      nPixels);

    // Verify that the image has the same size (in bytes) as the vector
    if (img.get_size() != sizeof(uint32_t) * rgba.size())
      throw std::logic_error("size mismatch");

    // Build the output data

    std::vector&amp;lt;uint32_t&amp;gt; rgbaCopy(rgba.size(), 0);

    // Use a kernel to copy the input data to the output data
    {
      buffer&amp;lt;uint32_t, 1&amp;gt; imgDataOutBuffer(
        rgbaCopy.data(),
        rgbaCopy.size());

      int const nMaxPixelsPerThread = 1 + (nPixels - 1) / nMaxItemsPerWorkgroup;
      int const pixelStride = 1 + (nPixels - 1) / nMaxPixelsPerThread;

      std::cout &amp;lt;&amp;lt; "call kernel with nPixels = '" &amp;lt;&amp;lt; nPixels &amp;lt;&amp;lt; "', pixelStride '" &amp;lt;&amp;lt; pixelStride &amp;lt;&amp;lt; "' ..." &amp;lt;&amp;lt; std::endl;

      q.submit([&amp;amp;](handler&amp;amp; h) {
        auto aImg = img.get_access&amp;lt;cl::sycl::cl_int4, access::mode::read&amp;gt;(h);
        auto aOutput = imgDataOutBuffer.get_access&amp;lt;access::mode::write&amp;gt;(h);
        auto const r = nd_range&amp;lt;1&amp;gt;{
          range(pixelStride),
          range(pixelStride)
        };
        h.parallel_for&amp;lt;class ReproMinimalDeadlock_Kernel&amp;gt;(
          r,
          [=](nd_item&amp;lt;1&amp;gt; it) {
            // do not read the last pixel if the image size is affected by the bug
            int const endPixel = nPixels - (readLastPixel ? 0 : 1);
            for (int i = it.get_local_id(); i &amp;lt; endPixel; i += pixelStride)
            {
              auto pixel = aImg.read(i);
              aOutput[4 * i + 0] = pixel.s0();
              aOutput[4 * i + 1] = pixel.s1();
              aOutput[4 * i + 2] = pixel.s2();
              aOutput[4 * i + 3] = pixel.s3();
            }
          });
        });
      q.wait();
      std::cout &amp;lt;&amp;lt; "writeback ..." &amp;lt;&amp;lt; std::endl;
    }
    std::cout &amp;lt;&amp;lt; "writeback done." &amp;lt;&amp;lt; std::endl;

    // Verify that the ouptut data matches the input data.
    // Skip the last pixel if the image size is affected by the bug.
      
    for (int i = 0, sz = rgbaCopy.size() - (readLastPixel ? 0 : 4); i &amp;lt; sz; ++i)
      if (rgba[i] != rgbaCopy[i])
        throw std::logic_error("in != out");
  }

  return 0;
}&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;My system details are:&lt;/P&gt;
&lt;P&gt;Intel One API base toolkit 2021:&lt;BR /&gt;- version 2021.1.0-2664&lt;/P&gt;
&lt;P&gt;Intel GPU (Infos from "Intel Graphics command center"):&lt;BR /&gt;- Graphics processor: Intel UHD Graphics&lt;BR /&gt;- Microsoft DirectX: 12&lt;BR /&gt;- Graphics Driver: 27.20.100.9316 (most current driver)&lt;BR /&gt;- Shader version: 5.1&lt;BR /&gt;- Vulkan: 1.2.167&lt;BR /&gt;- Graphics Memory: Dedicated O GB, Shared 16GB&lt;BR /&gt;- Vendor ID : 8086&lt;BR /&gt;- Device ID: 9BC4&lt;BR /&gt;- Device Revision: 05&lt;BR /&gt;- Graphics Output Protocol Version : 9.0.1105&lt;BR /&gt;- Max. Supported Monitors : 3&lt;/P&gt;
&lt;P&gt;- Device Specifications:&lt;BR /&gt;Processor Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz 2.71 GHz&lt;BR /&gt;Installed RAM 32.0 GB (31.6 GB usable)&lt;BR /&gt;System type 64-bit operating system, x64-based processor&lt;BR /&gt;Pen and touch No pen or touch input is available for this display&lt;/P&gt;
&lt;P&gt;- Windows specifications:&lt;BR /&gt;Edition Windows 10 Pro&lt;BR /&gt;Version 20H2&lt;BR /&gt;OS build 19042.804&lt;BR /&gt;Experience Windows Feature Experience Pack 120.2212.551.0&lt;/P&gt;
&lt;P&gt;Any help on that matter is much appreciated.&lt;/P&gt;
&lt;P&gt;The zip filed attached to this post contains the source code and the visual studio project.&lt;/P&gt;
&lt;P&gt;Thank you,&lt;/P&gt;
&lt;P&gt;Olivier&lt;/P&gt;</description>
    <pubDate>Mon, 15 Mar 2021 12:21:53 GMT</pubDate>
    <dc:creator>Olivier48</dc:creator>
    <dc:date>2021-03-15T12:21:53Z</dc:date>
    <item>
      <title>Some image sizes cause a kernel deadlock when the last pixel is read</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Some-image-sizes-cause-a-kernel-deadlock-when-the-last-pixel-is/m-p/1264402#M1016</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;I think I found a bug on my system : for some image sizes (multiples of 256 on my system), reading the _last_ pixel causes the kernel to deadlock.&lt;/P&gt;
&lt;P&gt;But I wonder if it is a bug or if it is me not using DPC++ correctly?&lt;/P&gt;
&lt;P&gt;The program hereunder reproduces this bug and can be used to detect image sizes that are affected by the bug:&lt;/P&gt;
&lt;LI-CODE lang="cpp"&gt;#include &amp;lt;CL/sycl.hpp&amp;gt;

#include &amp;lt;algorithm&amp;gt;
#include &amp;lt;iostream&amp;gt;
#include &amp;lt;vector&amp;gt;

/*
* This program reproduces a bug found on my system, where
* for _some_ image sizes, reading the _last_ pixel causes a deadlock.
*/
std::vector&amp;lt;int&amp;gt; affectedImageSizes = { {} };
/*
* I used this procedure to detect the image sizes affected by this bug:
* - run the program
* - if the program deadlocks (the last log in the console is: 'writeback ...'):
*     add the corresponding image size to 'affectedImageSizes' and repeat.
*
* On my system the result is: affectedImageSizes = { {256, 512, 768, 1024, 1280, 1536, 1792} };
* i.e every image whose size is a multiple of 256 is affected by the bug.
*/

constexpr uint32_t Red = 0b10101100111100001111011100001000;
constexpr uint32_t Green = 0b10111101101101001101011000111000;
constexpr uint32_t Blue = 0b00101110111100011101011110001001;
constexpr uint32_t Alpha = 0b00101000111100001101011101001000;

int main()
{
  using namespace cl::sycl;

  auto ehandler = [](exception_list exceptionList) {
    for (std::exception_ptr const&amp;amp; e : exceptionList)
    {
      try
      {
        std::rethrow_exception(e);
      }
      catch (exception const&amp;amp; e)
      {
        std::terminate();
      }
    }
  };

  queue q(gpu_selector(), ehandler);

  int const nMaxItemsPerWorkgroup =
    q.get_device().get_info&amp;lt;info::device::max_work_group_size&amp;gt;();

  for (int nPixels = 1; nPixels &amp;lt; 2000; ++nPixels)
  {
    std::cout &amp;lt;&amp;lt; "[With " &amp;lt;&amp;lt; nPixels &amp;lt;&amp;lt; " pixels]" &amp;lt;&amp;lt; std::endl;
    bool const readLastPixel = std::find(affectedImageSizes.begin(), affectedImageSizes.end(), nPixels) == affectedImageSizes.end();

    // Build the input data

    std::vector&amp;lt;uint32_t&amp;gt; rgba;
    rgba.reserve(nPixels * 4);

    for (int i = 0; i &amp;lt; nPixels; ++i)
    {
      rgba.push_back(Red);
      rgba.push_back(Green);
      rgba.push_back(Blue);
      rgba.push_back(Alpha);
    }

    image&amp;lt;1&amp;gt; img(
      rgba.data(),
      image_channel_order::rgba,
      image_channel_type::unsigned_int32,
      nPixels);

    // Verify that the image has the same size (in bytes) as the vector
    if (img.get_size() != sizeof(uint32_t) * rgba.size())
      throw std::logic_error("size mismatch");

    // Build the output data

    std::vector&amp;lt;uint32_t&amp;gt; rgbaCopy(rgba.size(), 0);

    // Use a kernel to copy the input data to the output data
    {
      buffer&amp;lt;uint32_t, 1&amp;gt; imgDataOutBuffer(
        rgbaCopy.data(),
        rgbaCopy.size());

      int const nMaxPixelsPerThread = 1 + (nPixels - 1) / nMaxItemsPerWorkgroup;
      int const pixelStride = 1 + (nPixels - 1) / nMaxPixelsPerThread;

      std::cout &amp;lt;&amp;lt; "call kernel with nPixels = '" &amp;lt;&amp;lt; nPixels &amp;lt;&amp;lt; "', pixelStride '" &amp;lt;&amp;lt; pixelStride &amp;lt;&amp;lt; "' ..." &amp;lt;&amp;lt; std::endl;

      q.submit([&amp;amp;](handler&amp;amp; h) {
        auto aImg = img.get_access&amp;lt;cl::sycl::cl_int4, access::mode::read&amp;gt;(h);
        auto aOutput = imgDataOutBuffer.get_access&amp;lt;access::mode::write&amp;gt;(h);
        auto const r = nd_range&amp;lt;1&amp;gt;{
          range(pixelStride),
          range(pixelStride)
        };
        h.parallel_for&amp;lt;class ReproMinimalDeadlock_Kernel&amp;gt;(
          r,
          [=](nd_item&amp;lt;1&amp;gt; it) {
            // do not read the last pixel if the image size is affected by the bug
            int const endPixel = nPixels - (readLastPixel ? 0 : 1);
            for (int i = it.get_local_id(); i &amp;lt; endPixel; i += pixelStride)
            {
              auto pixel = aImg.read(i);
              aOutput[4 * i + 0] = pixel.s0();
              aOutput[4 * i + 1] = pixel.s1();
              aOutput[4 * i + 2] = pixel.s2();
              aOutput[4 * i + 3] = pixel.s3();
            }
          });
        });
      q.wait();
      std::cout &amp;lt;&amp;lt; "writeback ..." &amp;lt;&amp;lt; std::endl;
    }
    std::cout &amp;lt;&amp;lt; "writeback done." &amp;lt;&amp;lt; std::endl;

    // Verify that the ouptut data matches the input data.
    // Skip the last pixel if the image size is affected by the bug.
      
    for (int i = 0, sz = rgbaCopy.size() - (readLastPixel ? 0 : 4); i &amp;lt; sz; ++i)
      if (rgba[i] != rgbaCopy[i])
        throw std::logic_error("in != out");
  }

  return 0;
}&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;My system details are:&lt;/P&gt;
&lt;P&gt;Intel One API base toolkit 2021:&lt;BR /&gt;- version 2021.1.0-2664&lt;/P&gt;
&lt;P&gt;Intel GPU (Infos from "Intel Graphics command center"):&lt;BR /&gt;- Graphics processor: Intel UHD Graphics&lt;BR /&gt;- Microsoft DirectX: 12&lt;BR /&gt;- Graphics Driver: 27.20.100.9316 (most current driver)&lt;BR /&gt;- Shader version: 5.1&lt;BR /&gt;- Vulkan: 1.2.167&lt;BR /&gt;- Graphics Memory: Dedicated O GB, Shared 16GB&lt;BR /&gt;- Vendor ID : 8086&lt;BR /&gt;- Device ID: 9BC4&lt;BR /&gt;- Device Revision: 05&lt;BR /&gt;- Graphics Output Protocol Version : 9.0.1105&lt;BR /&gt;- Max. Supported Monitors : 3&lt;/P&gt;
&lt;P&gt;- Device Specifications:&lt;BR /&gt;Processor Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz 2.71 GHz&lt;BR /&gt;Installed RAM 32.0 GB (31.6 GB usable)&lt;BR /&gt;System type 64-bit operating system, x64-based processor&lt;BR /&gt;Pen and touch No pen or touch input is available for this display&lt;/P&gt;
&lt;P&gt;- Windows specifications:&lt;BR /&gt;Edition Windows 10 Pro&lt;BR /&gt;Version 20H2&lt;BR /&gt;OS build 19042.804&lt;BR /&gt;Experience Windows Feature Experience Pack 120.2212.551.0&lt;/P&gt;
&lt;P&gt;Any help on that matter is much appreciated.&lt;/P&gt;
&lt;P&gt;The zip filed attached to this post contains the source code and the visual studio project.&lt;/P&gt;
&lt;P&gt;Thank you,&lt;/P&gt;
&lt;P&gt;Olivier&lt;/P&gt;</description>
      <pubDate>Mon, 15 Mar 2021 12:21:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Some-image-sizes-cause-a-kernel-deadlock-when-the-last-pixel-is/m-p/1264402#M1016</guid>
      <dc:creator>Olivier48</dc:creator>
      <dc:date>2021-03-15T12:21:53Z</dc:date>
    </item>
    <item>
      <title>Re: Some image sizes cause a kernel deadlock when the last pixel is read</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Some-image-sizes-cause-a-kernel-deadlock-when-the-last-pixel-is/m-p/1264781#M1023</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The issue is reproducible on windows (with Gen9 iGPU). However, it works fine on the CPU side.&lt;/P&gt;
&lt;P&gt;On Linux, I haven't noticed any issue with Gen9 iGPU.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We are working on this issue and will get back to you.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks,&lt;/P&gt;
&lt;P&gt;Rahul&lt;/P&gt;</description>
      <pubDate>Tue, 16 Mar 2021 12:33:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Some-image-sizes-cause-a-kernel-deadlock-when-the-last-pixel-is/m-p/1264781#M1023</guid>
      <dc:creator>RahulV_intel</dc:creator>
      <dc:date>2021-03-16T12:33:56Z</dc:date>
    </item>
    <item>
      <title>Re:Some image sizes cause a kernel deadlock when the last pixel is read</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Some-image-sizes-cause-a-kernel-deadlock-when-the-last-pixel-is/m-p/1279076#M1139</link>
      <description>&lt;P&gt;Hi Oliver,&lt;/P&gt;&lt;P&gt;I am escalating this issue to the engineering as this issue is reproducible at my end.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Subarna&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 05 May 2021 13:44:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Some-image-sizes-cause-a-kernel-deadlock-when-the-last-pixel-is/m-p/1279076#M1139</guid>
      <dc:creator>Subarnarek_G_Intel</dc:creator>
      <dc:date>2021-05-05T13:44:45Z</dc:date>
    </item>
    <item>
      <title>Re:Some image sizes cause a kernel deadlock when the last pixel is read</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Some-image-sizes-cause-a-kernel-deadlock-when-the-last-pixel-is/m-p/1286548#M1249</link>
      <description>&lt;P&gt;This issue is already fixed in 2021.2 with the latest driver. &lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 02 Jun 2021 14:42:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Some-image-sizes-cause-a-kernel-deadlock-when-the-last-pixel-is/m-p/1286548#M1249</guid>
      <dc:creator>Subarnarek_G_Intel</dc:creator>
      <dc:date>2021-06-02T14:42:46Z</dc:date>
    </item>
    <item>
      <title>Re:Some image sizes cause a kernel deadlock when the last pixel is read</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Some-image-sizes-cause-a-kernel-deadlock-when-the-last-pixel-is/m-p/1286551#M1250</link>
      <description>&lt;P&gt;A kernel lambda passed to parallel_for will be called once for each item in the range. This is true for all invocations of parallel_for, so whether you use nd_range and nd_item, or just regular range and item, you can expect every item in the range to be visited. So, in the case that the ranges passed match the size of your image, then each pixel will be visited one time. Your code seems to expect the kernel to visit each workgroup exactly once, but that is incorrect.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;I have attached a code demonstrates this.&lt;/P&gt;&lt;P&gt;#include &amp;lt;CL/sycl.hpp&amp;gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;using namespace cl::sycl;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;int main(){&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;    queue q;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;    int const nMaxItemsPerWorkgroup = q.get_device().get_info&amp;lt;info::device::max_work_group_size&amp;gt;();&lt;/P&gt;&lt;P&gt;    std::cout &amp;lt;&amp;lt; "nMaxItemsPerWorkgroup: " &amp;lt;&amp;lt; nMaxItemsPerWorkgroup &amp;lt;&amp;lt; std::endl;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;    // 12 items total, 3 per group. which means there will be four groups.&lt;/P&gt;&lt;P&gt;    auto const r = nd_range&amp;lt;1&amp;gt;{&lt;/P&gt;&lt;P&gt;        range(12), //global&lt;/P&gt;&lt;P&gt;        range(3)  //local&lt;/P&gt;&lt;P&gt;    };&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;    q.submit([&amp;amp;](handler&amp;amp; cgh) {&lt;/P&gt;&lt;P&gt;        cl::sycl::stream out(1024, 100, cgh);&lt;/P&gt;&lt;P&gt;        cgh.parallel_for&amp;lt;class okay&amp;gt;(r, [=](nd_item&amp;lt;1&amp;gt; it) {&lt;/P&gt;&lt;P&gt;            int global_id = it.get_global_id();&lt;/P&gt;&lt;P&gt;            int local_id  = it.get_local_id();&lt;/P&gt;&lt;P&gt;            auto local_r  = it.get_local_range();&lt;/P&gt;&lt;P&gt;            int group_id  = it.get_group_linear_id();&lt;/P&gt;&lt;P&gt;            auto group_r  = it.get_group_range();&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;            out &amp;lt;&amp;lt; "global_id: " &amp;lt;&amp;lt; global_id&lt;/P&gt;&lt;P&gt;                &amp;lt;&amp;lt; " local_id: " &amp;lt;&amp;lt; local_id&lt;/P&gt;&lt;P&gt;                &amp;lt;&amp;lt; "  local_r: " &amp;lt;&amp;lt; local_r.get(0)  //1 or 0 ?&lt;/P&gt;&lt;P&gt;                &amp;lt;&amp;lt; " group_id: " &amp;lt;&amp;lt; group_id&lt;/P&gt;&lt;P&gt;                &amp;lt;&amp;lt; "  group_r: " &amp;lt;&amp;lt; group_r.get(0) &lt;/P&gt;&lt;P&gt;                &amp;lt;&amp;lt; cl::sycl::endl;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;        });&lt;/P&gt;&lt;P&gt;    });&lt;/P&gt;&lt;P&gt;    q.wait();&lt;/P&gt;&lt;P&gt;    return 0;&lt;/P&gt;&lt;P&gt;}&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;/*&lt;/P&gt;&lt;P&gt;SYCL_DEVICE_FILTER=opencl:cpu ./sim.bin&lt;/P&gt;&lt;P&gt;global_id: 0 local_id: 0  local_r: 3 group_id: 0  group_r: 4&lt;/P&gt;&lt;P&gt;global_id: 1 local_id: 1  local_r: 3 group_id: 0  group_r: 4&lt;/P&gt;&lt;P&gt;global_id: 2 local_id: 2  local_r: 3 group_id: 0  group_r: 4&lt;/P&gt;&lt;P&gt;global_id: 9 local_id: 0  local_r: 3 group_id: 3  group_r: 4&lt;/P&gt;&lt;P&gt;global_id: 10 local_id: 1  local_r: 3 group_id: 3  group_r: 4&lt;/P&gt;&lt;P&gt;global_id: 11 local_id: 2  local_r: 3 group_id: 3  group_r: 4&lt;/P&gt;&lt;P&gt;global_id: 6 local_id: 0  local_r: 3 group_id: 2  group_r: 4&lt;/P&gt;&lt;P&gt;global_id: 7 local_id: 1  local_r: 3 group_id: 2  group_r: 4&lt;/P&gt;&lt;P&gt;global_id: 8 local_id: 2  local_r: 3 group_id: 2  group_r: 4&lt;/P&gt;&lt;P&gt;global_id: 3 local_id: 0  local_r: 3 group_id: 1  group_r: 4&lt;/P&gt;&lt;P&gt;global_id: 4 local_id: 1  local_r: 3 group_id: 1  group_r: 4&lt;/P&gt;&lt;P&gt;global_id: 5 local_id: 2  local_r: 3 group_id: 1  group_r: 4&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;*/&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Additionally, in your code when you set up the nd_range you do so like this:&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;auto const r = nd_range&amp;lt;1&amp;gt;{&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;range(pixelStride), //global&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;range(pixelStride)&amp;nbsp;//local&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;};&lt;/P&gt;&lt;P&gt;You are essentially telling SYCL that you have pixelStride items, and want exactly one workgroup with pixelStride items in it. Is that the desire? When I first looked at this I thought perhaps you intended range(nPixels) for the global range. But, if so, then you need to ensure pixelStride evenly divides it, or pass the -cl-std=CL2.0 flag to the compiler to enable the OpenCL support for partial workgroups. (Also note, this requirement may vary by device. The CPU device doesn't care, but GPU devices do).&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 02 Jun 2021 14:57:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Some-image-sizes-cause-a-kernel-deadlock-when-the-last-pixel-is/m-p/1286551#M1250</guid>
      <dc:creator>Subarnarek_G_Intel</dc:creator>
      <dc:date>2021-06-02T14:57:59Z</dc:date>
    </item>
    <item>
      <title>Re:Some image sizes cause a kernel deadlock when the last pixel is read</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Some-image-sizes-cause-a-kernel-deadlock-when-the-last-pixel-is/m-p/1299078#M1420</link>
      <description>&lt;P&gt;&lt;I style="font-family: Calibri, sans-serif; font-size: 11pt;"&gt;This issue has been resolved and we will no longer respond to this thread.&amp;nbsp;If you require additional assistance from Intel, please start a new thread.&amp;nbsp;Any further interaction in this thread will be considered community only&lt;/I&gt;&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 16 Jul 2021 06:14:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Some-image-sizes-cause-a-kernel-deadlock-when-the-last-pixel-is/m-p/1299078#M1420</guid>
      <dc:creator>Subarnarek_G_Intel</dc:creator>
      <dc:date>2021-07-16T06:14:54Z</dc:date>
    </item>
    <item>
      <title>Re: Re:Some image sizes cause a kernel deadlock when the last pixel is read</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Some-image-sizes-cause-a-kernel-deadlock-when-the-last-pixel-is/m-p/1299106#M1421</link>
      <description>&lt;P&gt;Hello Subarnarek_G_Intel,&lt;/P&gt;
&lt;P&gt;I ran the program again today, and saw that with the latest drivers, there is no deadlock anymore. So Intel must have done something in the drivers that fixes this bug. Thank you for that!&lt;/P&gt;
&lt;P&gt;&lt;A id="previewButton_ca2f54f862060c_4f734" class="lia-link-navigation lia-message-editor-preview-button" href="https://community.intel.com/t5/Intel-oneAPI-Data-Parallel-C/Some-image-sizes-cause-a-kernel-deadlock-when-the-last-pixel-is/m-p/1299078/emcs_t/S2h8ZW1haWx8dG9waWNfc3Vic2NyaXB0aW9ufEtSNVk1RTY4VkYzRUdCfDEyOTkwNzh8U1VCU0NSSVBUSU9OU3xoSw#" target="_blank"&gt;Preview&lt;/A&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/48567"&gt;@Subarnarek_G_Intel&lt;/a&gt;&amp;nbsp;wrote:
&lt;P&gt;You are essentially telling SYCL that you have pixelStride items, and want exactly one workgroup with pixelStride items in it. Is that the desire?&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Yes, this is the desire. This code is a reduced version of a more complex code that relies on workgroup barriers to synchronize, so I can only use one workgroup, else I will run into race conditions.&lt;/P&gt;
&lt;P&gt;The maximum number of items per workgroup is:&lt;BR /&gt;nMaxItemsPerWorkgroup = info::device::max_work_group_size.&lt;/P&gt;
&lt;P&gt;Hence, to use at most one workgroup, each thread in the workgroup needs to handle (at most) that many pixels:&lt;BR /&gt;nMaxPixelsPerThread = 1 + (nPixels - 1) / nMaxItemsPerWorkgroup&lt;/P&gt;
&lt;P&gt;Hence, the stride needs to be:&lt;BR /&gt;pixelStride = 1 + (nPixels - 1) / nMaxPixelsPerThread;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://community.intel.com/t5/user/viewprofilepage/user-id/48567"&gt;@Subarnarek_G_Intel&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;Your code seems to expect the kernel to visit each workgroup exactly once, but that is incorrect.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;I think the code is correct. If after reading the explanations above you still don't think the code is correct please explain in more detail what you think the problem is, and how you would fix it.&lt;/P&gt;
&lt;P&gt;Thank you,&lt;BR /&gt;Olivier&lt;/P&gt;</description>
      <pubDate>Fri, 16 Jul 2021 07:42:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/Some-image-sizes-cause-a-kernel-deadlock-when-the-last-pixel-is/m-p/1299106#M1421</guid>
      <dc:creator>Olivier48</dc:creator>
      <dc:date>2021-07-16T07:42:15Z</dc:date>
    </item>
  </channel>
</rss>

