Intel® oneAPI DPC++/C++ Compiler
Talk to fellow users of Intel® oneAPI DPC++/C++ Compiler and companion tools like Intel® oneAPI DPC++ Library, Intel® DPC++ Compatibility Tool, and Intel® Distribution for GDB*
641 Discussions

encountered problems when using the GPU to run the program on the devcloud

HongbinBao
Novice
3,133 Views

I encountered problems when using the GPU to run the program on the devcloud:

 

terminate called after throwing an instance of 'sycl::_V1::runtime_error'

  what():  Native API failed. Native API returns: -1 (PI_ERROR_DEVICE_NOT_FOUND) -1 (PI_ERROR_DEVICE_NOT_FOUND)

 

 

It's hard to say how to reproduce, I don't have this error in many cases.
This error shows so little useful information that it is difficult for me to locate the problem.

I'm not sure if this is due to the calculation being done on the GPU without returning a result.
Does anyone know more about this error and can provide me with more information?

Labels (1)
0 Kudos
10 Replies
JaideepK_Intel
Employee
3,090 Views

Hi,

 

Thank you for posting in Intel Communities.

 

The reason behind the below error was, you are trying to run a GPU binary on the non-GPU node on Devcloud i.e. the node you are using doesn't have a GPU device on it.
JaideepK_Intel_0-1692603613826.png
How to check the GPU device/list of devices on a particular node? use the below command :

sycl-ls

 

Please use the below command to access a particular GPU node:

qsub -I -l nodes=1:gpu:ppn=2 -d .

To know more about job submission commands please follow the below link:

https://devcloud.intel.com/oneapi/documentation/job-submission/

 

If this resolves your issue, make sure to accept this as a solution. This would help others with similar issues. Thank you!

 

Regards,

Jaideep

 

 

0 Kudos
HongbinBao
Novice
3,079 Views

Hi,

I use qsub -I -l nodes=1:gpu:ppn=2 -d . Assign me computing node

 

sycl-ls

Message as follows:

[opencl:cpu:0] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i9-11900KB @ 3.30GHz 3.0 [2023.16.7.0.21_160000]

[opencl:gpu:1] Intel(R) OpenCL HD Graphics, Intel(R) UHD Graphics [0x9a60] 3.0 [22.43.24595.35]

[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) UHD Graphics [0x9a60] 1.3 [1.3.24595]

 

 

This error occurs about 20 seconds after running the program:

 

Running on: Intel(R) UHD Graphics [0x9a60]

terminate called after throwing an instance of 'sycl::_V1::runtime_error'

what(): Native API failed. Native API returns: -1 (PI_ERROR_DEVICE_NOT_FOUND) -1 (PI_ERROR_DEVICE_NOT_FOUND)

Aborted

 

real 0m20.771s

user 0m8.174s

sys 0m12.573s

 

Then I check the device information:

sycl-ls

[opencl:cpu:0] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i9-11900KB @ 3.30GHz 3.0 [2023.16.7.0.21_160000]

[opencl:gpu:1] Intel(R) OpenCL HD Graphics, Intel(R) UHD Graphics [0x9a60] 3.0 [22.43.24595.35]

[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) UHD Graphics [0x9a60] 1.3 [1.3.24595]

 

 

My sycl select device code is as follows:

 

cl::sycl::queue deviceQueue(cl::sycl::default_selector_v);

std::cout << "Running on: "

<< deviceQueue.get_device().get_info<cl::sycl::info::device::name>()

<< std::endl;

 

The queue task submission code is roughly as follows:

    u->queue.submit([&](cl::sycl::handler& cgh) {
  cgh.single_task<class my_kernel>([=]() {

  });
});
u->queue.wait_and_throw();

 

 

In fact, the situation where this error occurs is: I used to allocate shared memory through USM, and then calculate on the GPU. After one calculation is completed, the kernel is interrupted, the result of this time is output, and then the kernel is restarted for the next calculation, and so on. In this case there is no error. But in this case, the efficiency will be very low, so I canceled the output code after the operation is completed, and I want the GPU to not interrupt until all operations are completed, and then this error will appear.

0 Kudos
VaishnaviV_Intel
Employee
3,046 Views

Hi,


Could you please share us the sample reproducer so that we can investigate your issue more thoroughly?


Thanks & Regards,

Vankudothu Vaishnavi.


0 Kudos
HongbinBao
Novice
2,923 Views

Hi,

Here's a minimal reproduction of the problem:

The reason for the problem is that there is an infinite loop in the SYCL kernel code. Even if a certain state of the infinite loop will stop, the same error will occur. At the same time, I am not sure how large the for loop will be. This error will appear. This doesn't seem to be the cause of DevCloud, but of dpcpp? I'm not sure why this is a problem.

 

Create a test file:

infinite_loop.cpp

 

#include <CL/sycl.hpp>

int main() {
  cl::sycl::queue queue;

 
  std::vector<int> data(1, 42);
  cl::sycl::buffer<int, 1> buffer(data.data(), data.size());

   
  queue.submit([&](cl::sycl::handler& cgh) {
      auto acc = buffer.get_access<cl::sycl::access::mode::read_write>(cgh);

      cgh.parallel_for<class infinite_loop>(
          cl::sycl::range<1>(data.size()),
          [=](cl::sycl::id<1> idx) {
              for(;;) {
                   
              }
          });
  });

  queue.wait_and_throw();

  return 0;
}

 

compile and run

icpx -fsycl infinite_loop.cpp -o infinite_loop
./infinite_loop

 

This is not caused by devcloud environment problems

 

 

 

Best regards

0 Kudos
HongbinBao
Novice
2,917 Views

The cause of the problem seems to be: A workload that takes more than four seconds for GPU hardware to execute is a long-running workload. By default, individual threads that qualify as long-running workloads are considered hung and are terminated.

https://www.intel.com/content/www/us/en/docs/oneapi/installation-guide-linux/2023-2/gpu-disable-hangcheck.html

 

0 Kudos
VaishnaviV_Intel
Employee
2,844 Views

Hi,


Thanks for sharing the reproducer with us.

>>The cause of the problem seems to be: A workload that takes more than four seconds for GPU hardware to execute is a long-running workload. By default, individual threads that qualify as long-running workloads are considered hung and are terminated.

https://www.intel.com/content/www/us/en/docs/oneapi/installation-guide-linux/2023-2/gpu-disable-hangcheck.html


Did disabling the GPU Hang check resolve the problem for you?


Thanks and Regards,

Vankudothu Vaishnavi.


0 Kudos
HongbinBao
Novice
2,821 Views

hi,

I don't have permission to execute this command on devcloud

 

sudo sh -c "echo N> /sys/module/i915/parameters/enable_hangcheck"

 

 

 

 


@VaishnaviV_Intel wrote:

Hi,

 

Thanks for sharing the reproducer with us.

>>The cause of the problem seems to be: A workload that takes more than four seconds for GPU hardware to execute is a long-running workload. By default, individual threads that qualify as long-running workloads are considered hung and are terminated.

https://www.intel.com/content/www/us/en/docs/oneapi/installation-guide-linux/2023-2/gpu-disable-hangcheck.html

 

Did disabling the GPU Hang check resolve the problem for you?

 

Thanks and Regards,

Vankudothu Vaishnavi.



 

0 Kudos
VaishnaviV_Intel
Employee
2,757 Views

Hi,

 

>>I don't have permission to execute this command on devcloud

Intel DevCloud is a shared environment which comes with pre-installed validated Intel oneAPI software and complimentary packages. As a policy, we do not install custom (open source or 3rd party licensed) software to the environment.

 

So, We can't help much here. If you still have any issues, do let us know.

 

Thanks & Regards,

Vankudothu Vaishnavi.


0 Kudos
VaishnaviV_Intel
Employee
2,679 Views

Hi,


We have not heard back from you.

Do you have any other issues? If no, could you please confirm whether we can close this thread from our end?


Thanks & Regards,

Vankudothu Vaishnavi.


0 Kudos
VaishnaviV_Intel
Employee
2,602 Views

Hi,


We haven't heard back from you. If you have any issues, please post a new question as this thread will no longer be monitored by Intel.


Thanks & Regards,

Vankudothu Vaishnavi.


0 Kudos
Reply