<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic I was just informed by an in Intel® oneAPI DPC++/C++ Compiler</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/USM-allocation-and-access/m-p/1179703#M352</link>
    <description>&lt;P&gt;I was just informed by an Intel colleague that the reason this works is because I'm running on an Intel integrated GPU, and these do not have their own dedicated memory; the memory is shared with the host and sits on the same silicon.&lt;/P&gt;</description>
    <pubDate>Thu, 07 May 2020 18:36:29 GMT</pubDate>
    <dc:creator>Pascuzzi__Vincent</dc:creator>
    <dc:date>2020-05-07T18:36:29Z</dc:date>
    <item>
      <title>USM allocation and access</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/USM-allocation-and-access/m-p/1179702#M351</link>
      <description>&lt;P&gt;Hi,&lt;BR /&gt;&lt;BR /&gt;I am using `malloc_device` to allocate device-side (Iris iGPU) memory through USM. An example of this is,&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt; int hostArray[42];
 int* deviceArray = (int*)malloc_device(42 * sizeof(int), dev, ctx_);
 for (int i = 0; i &amp;lt; 42; i++) hostArray&lt;I&gt; = 42;
 queue_.submit([&amp;amp;](cl::sycl::handler&amp;amp; h) {
   // copy hostArray to deviceArray
   h.memcpy(deviceArray, &amp;amp;hostArray[0], 42 * sizeof(int));
 });
 queue_.wait();&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;Now, what’s odd to me is that I can access `deviceArray` outside a kernel, like this:&lt;/P&gt;

&lt;PRE class="brush:cpp; class-name:dark;"&gt;std::cout &amp;lt;&amp;lt; "deviceArray[10] = " &amp;lt;&amp;lt; deviceArray[10] &amp;lt;&amp;lt; std::endl;&lt;/PRE&gt;

&lt;P&gt;Because of “unified addressing” [1], and since all USM allocations are done “on the host” [2], I assumed that I was outputting the `hostArray` memory address. Fine. But then I modified the value of `deviceArray[10]` inside a kernel:&lt;/P&gt;

&lt;PRE class="brush:cpp; class-name:dark;"&gt;&amp;nbsp;queue_.submit([&amp;amp;](cl::sycl::handler&amp;amp; h) {
&amp;nbsp;&amp;nbsp;&amp;nbsp;h.parallel_for&amp;lt;class foo&amp;gt;(cl::sycl::range&amp;lt;1&amp;gt;{42}, [=](cl::sycl::id&amp;lt;1&amp;gt; ID) {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;int i = ID[0];
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;deviceArray&lt;I&gt;++;
&amp;nbsp;&amp;nbsp;&amp;nbsp;});
&amp;nbsp;});
&amp;nbsp;queue_.wait();&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;and I was still able to access `deviceArray` outside a kernel in the same way as above, and modified value was printed. I can understand this working fine if I was using the “host device” but can’t wrap my head around accessing “GPU device” memory outside a kernel, especially if it has been modified inside a kernel. A complete MWE is at the bottom of this email. I suggest running on an Iris node.&lt;BR /&gt;&lt;BR /&gt;Can someone explain this to me? Why/how is it possible to access memory allocated with `malloc_device` outside a kernel? If it’s because it’s allocated by the host — and is duplicated on the host — why when modifications are made inside a kernel does it affect the host memory? `malloc_shared` — wherein data is migrated back and forth between the host and device — would be a different story.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Thanks,&lt;BR /&gt;Vince&lt;BR /&gt;&lt;BR /&gt;[1]&amp;nbsp;https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/USM/USM.adoc&lt;BR /&gt;[2]&amp;nbsp;https://www.colfax-intl.com/downloads/oneAPI_module04_DPCplusplusFundamentals2of2.pdf&amp;nbsp;(slide 23)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;// Compile with:&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:cpp; class-name:dark;"&gt;[dpcpp | clang++] -fsycl -o mwe_malloc_device&amp;nbsp;mwe_malloc_device.cc&lt;/PRE&gt;

&lt;PRE class="brush:cpp; class-name:dark;"&gt;//
// mwe_malloc_device.cc
//

#include &amp;lt;CL/sycl.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;

#ifdef USE_PI_CUDA
class CUDASelector : public cl::sycl::device_selector {
 public:
  int operator()(const cl::sycl::device&amp;amp; Device) const override {
    using namespace cl::sycl::info;

    const std::string DeviceName = Device.get_info&amp;lt;device::name&amp;gt;();
    const std::string DeviceVendor = Device.get_info&amp;lt;device::vendor&amp;gt;();
    const std::string DeviceDriver =
        Device.get_info&amp;lt;cl::sycl::info::device::driver_version&amp;gt;();

    if (Device.is_gpu() &amp;amp;&amp;amp; (DeviceVendor.find("NVIDIA") != std::string::npos) &amp;amp;&amp;amp;
        (DeviceDriver.find("CUDA") != std::string::npos)) {
      return 1;
    };
    return -1;
  }
};
#endif

int main() {
  // Catch asynchronous exceptions
  auto exception_handler = [](cl::sycl::exception_list exceptions) {
    for (std::exception_ptr const&amp;amp; e : exceptions) {
      try {
        std::rethrow_exception(e);
      } catch (cl::sycl::exception const&amp;amp; e) {
        std::cout &amp;lt;&amp;lt; "Caught asynchronous SYCL exception during generation:\n"
                  &amp;lt;&amp;lt; e.what() &amp;lt;&amp;lt; std::endl;
      }
    }
  };
  // Initialize device, queue and context
  cl::sycl::device dev;
#ifdef USE_PI_CUDA
  CUDASelector cuda_selector;
  try {
    dev = cl::sycl::device(cuda_selector);
  } catch (...) {
  }
#elif USE_SYCL_CPU
  dev = cl::sycl::device(cl::sycl::cpu_selector());
#elif USE_SYCL_GPU
  dev = cl::sycl::device(cl::sycl::gpu_selector());
#else
  dev = cl::sycl::device(cl::sycl::default_selector());
#endif
  cl::sycl::queue queue = cl::sycl::queue(dev, exception_handler);
  cl::sycl::context ctx = queue.get_context();
  // Name of the device to run on
  std::string dev_name =
      queue.get_device().get_info&amp;lt;cl::sycl::info::device::name&amp;gt;();
  std::cout &amp;lt;&amp;lt; "Using device \"" &amp;lt;&amp;lt; dev_name &amp;lt;&amp;lt; "\"" &amp;lt;&amp;lt; std::endl;

  // Ensure device can handle USM device allocations.
  if (!queue.get_device()
           .get_info&amp;lt;cl::sycl::info::device::usm_device_allocations&amp;gt;()) {
    std::cout &amp;lt;&amp;lt; "ERROR :: device \"" &amp;lt;&amp;lt; dev_name
              &amp;lt;&amp;lt; "\" does not support usm_device_allocations!" &amp;lt;&amp;lt; std::endl;
    return 1;
  }
  int hostArray[42];
  int* deviceArray = (int*)malloc_device(42 * sizeof(int), dev, ctx);
  for (int i = 0; i &amp;lt; 42; i++) hostArray&lt;I&gt; = 42;
  queue
      .submit([&amp;amp;](cl::sycl::handler&amp;amp; h) {
        // copy hostArray to deviceArray
        h.memcpy(deviceArray, &amp;amp;hostArray[0], 42 * sizeof(int));
      })
      .wait();

  std::cout &amp;lt;&amp;lt; "[Before mod] deviceArray[10] = " &amp;lt;&amp;lt; deviceArray[10]
            &amp;lt;&amp;lt; std::endl;

  queue.submit([&amp;amp;](cl::sycl::handler&amp;amp; h) {
    h.parallel_for&amp;lt;class foo&amp;gt;(
        cl::sycl::range&amp;lt;1&amp;gt;{42},
        // lambda-capture so we get the actual device memory
        [=](cl::sycl::id&amp;lt;1&amp;gt; ID) {
          int i = ID[0];
          dev_arr&lt;I&gt;++;
        });
  });
  queue.wait();

  std::cout &amp;lt;&amp;lt; "[After mod] deviceArray[10] = " &amp;lt;&amp;lt; deviceArray[10] &amp;lt;&amp;lt; std::endl;

  return 0;
}&lt;/I&gt;&lt;/I&gt;&lt;/PRE&gt;</description>
      <pubDate>Thu, 07 May 2020 17:54:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/USM-allocation-and-access/m-p/1179702#M351</guid>
      <dc:creator>Pascuzzi__Vincent</dc:creator>
      <dc:date>2020-05-07T17:54:00Z</dc:date>
    </item>
    <item>
      <title>I was just informed by an</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/USM-allocation-and-access/m-p/1179703#M352</link>
      <description>&lt;P&gt;I was just informed by an Intel colleague that the reason this works is because I'm running on an Intel integrated GPU, and these do not have their own dedicated memory; the memory is shared with the host and sits on the same silicon.&lt;/P&gt;</description>
      <pubDate>Thu, 07 May 2020 18:36:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/USM-allocation-and-access/m-p/1179703#M352</guid>
      <dc:creator>Pascuzzi__Vincent</dc:creator>
      <dc:date>2020-05-07T18:36:29Z</dc:date>
    </item>
    <item>
      <title>Hi Vincent,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/USM-allocation-and-access/m-p/1179704#M353</link>
      <description>&lt;P&gt;Hi Vincent,&lt;/P&gt;&lt;P&gt;Integrated GPU(iGPU) shares memory with the host and that could be one of the reasons for such behavior.&amp;nbsp;However, this shouldn't be the case on a discrete GPU.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;--Rahul&lt;/P&gt;</description>
      <pubDate>Fri, 08 May 2020 05:51:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/USM-allocation-and-access/m-p/1179704#M353</guid>
      <dc:creator>RahulV_intel</dc:creator>
      <dc:date>2020-05-08T05:51:15Z</dc:date>
    </item>
    <item>
      <title>Quote:Vaidya, Rahul (Intel)</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/USM-allocation-and-access/m-p/1179705#M354</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Vaidya, Rahul (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Integrated GPU(iGPU) shares memory with the host and that could be one of the reasons for such behavior.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Indeed, this is what I figured -- makes sense (and gives me piece of mind!).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Vaidya, Rahul (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;However, this shouldn't be the case on a discrete GPU.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Right again!&lt;/P&gt;&lt;P&gt;I can build and run the same code, modulo the `cout`s, using the Intel llvm CUDA support:&lt;/P&gt;
&lt;PRE class="brush:bash; class-name:dark;"&gt;[dpcpp | clang++] -O2 -fsycl -std=c++17 -fsycl-targets=nvptx64-nvidia-cuda-sycldevice -Wno-unknown-cuda-version -o mwe_malloc_device&amp;nbsp;mwe_malloc_device.cc&amp;nbsp;
&lt;/PRE&gt;

&lt;P&gt;I get a segfault when trying to access the device memory outside a kernel — e.g.&amp;nbsp;using `cout` here. If I replace the `cout` calls with `cl::sycl::stream`s, it works as expected on the CUDA device:&lt;/P&gt;

&lt;PRE class="brush:cpp; class-name:dark;"&gt;…
&amp;nbsp;// Outputs 42
&amp;nbsp;queue
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;.submit([&amp;amp;](cl::sycl::handler&amp;amp; cgh) {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;cl::sycl::stream out(1024, 256, cgh);
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;cgh.single_task&amp;lt;class print1&amp;gt;(
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;[=] { out &amp;lt;&amp;lt; deviceArray[10] &amp;lt;&amp;lt; cl::sycl::endl; });
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;})
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;.wait_and_throw();

&amp;nbsp;queue.submit([&amp;amp;](cl::sycl::handler&amp;amp; h) {
&amp;nbsp;&amp;nbsp;&amp;nbsp;h.parallel_for&amp;lt;class foo&amp;gt;(
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;cl::sycl::range&amp;lt;1&amp;gt;{42},
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;// lambda-capture so we get the actual device memory
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;[=](cl::sycl::id&amp;lt;1&amp;gt; ID) {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;int i = ID[0];
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;deviceArray&lt;I&gt;++;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;});
&amp;nbsp;});
&amp;nbsp;queue.wait();

&amp;nbsp;// Outputs 43
&amp;nbsp;queue
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;.submit([&amp;amp;](cl::sycl::handler&amp;amp; cgh) {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;cl::sycl::stream out(1024, 256, cgh);
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;cgh.single_task&amp;lt;class print2&amp;gt;(
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;[=] { out &amp;lt;&amp;lt; deviceArray[10] &amp;lt;&amp;lt; cl::sycl::endl; });
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;})
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;.wait_and_throw();

return 0;&lt;/I&gt;&lt;/PRE&gt;

&lt;P&gt;So indeed USM works fine using a discrete CUDA device in this simple example. Looking forward to getting my hands on an Intel Xe device!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Cheers,&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Vince&lt;/P&gt;</description>
      <pubDate>Fri, 08 May 2020 14:25:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/USM-allocation-and-access/m-p/1179705#M354</guid>
      <dc:creator>Pascuzzi__Vincent</dc:creator>
      <dc:date>2020-05-08T14:25:00Z</dc:date>
    </item>
    <item>
      <title>Hi,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/USM-allocation-and-access/m-p/1179706#M355</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;That's great! Thanks for the confirmation.&lt;/P&gt;&lt;P&gt;Let us know if we can close this thread.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;--Rahul&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 12 May 2020 05:25:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/USM-allocation-and-access/m-p/1179706#M355</guid>
      <dc:creator>RahulV_intel</dc:creator>
      <dc:date>2020-05-12T05:25:36Z</dc:date>
    </item>
    <item>
      <title>Quote:Vaidya, Rahul (Intel)</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/USM-allocation-and-access/m-p/1179707#M356</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Vaidya, Rahul (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Let us know if we can close this thread.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Please do.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Cheers,&lt;/P&gt;&lt;P&gt;Vince&lt;/P&gt;</description>
      <pubDate>Tue, 12 May 2020 15:21:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/USM-allocation-and-access/m-p/1179707#M356</guid>
      <dc:creator>Pascuzzi__Vincent</dc:creator>
      <dc:date>2020-05-12T15:21:13Z</dc:date>
    </item>
    <item>
      <title>Thanks for the update, Vince.</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/USM-allocation-and-access/m-p/1179708#M357</link>
      <description>&lt;P&gt;Thanks for the update, Vince. We will go ahead and close this thread. Feel free to post a new thread&amp;nbsp;if you have any further queries.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;--Rahul&lt;/P&gt;</description>
      <pubDate>Wed, 13 May 2020 04:34:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/USM-allocation-and-access/m-p/1179708#M357</guid>
      <dc:creator>RahulV_intel</dc:creator>
      <dc:date>2020-05-13T04:34:00Z</dc:date>
    </item>
  </channel>
</rss>

