encountered problems when using the GPU to run the program on the devcloud

HongbinBao · ‎08-19-2023

I encountered problems when using the GPU to run the program on the devcloud:

terminate called after throwing an instance of 'sycl::_V1::runtime_error'

what(): Native API failed. Native API returns: -1 (PI_ERROR_DEVICE_NOT_FOUND) -1 (PI_ERROR_DEVICE_NOT_FOUND)

It's hard to say how to reproduce, I don't have this error in many cases.
This error shows so little useful information that it is difficult for me to locate the problem.

I'm not sure if this is due to the calculation being done on the GPU without returning a result.
Does anyone know more about this error and can provide me with more information?

JaideepK_Intel · ‎08-21-2023

Hi,

Thank you for posting in Intel Communities.

The reason behind the below error was, you are trying to run a GPU binary on the non-GPU node on Devcloud i.e. the node you are using doesn't have a GPU device on it.

How to check the GPU device/list of devices on a particular node? use the below command :

sycl-ls

Please use the below command to access a particular GPU node:

qsub -I -l nodes=1:gpu:ppn=2 -d .

To know more about job submission commands please follow the below link:

https://devcloud.intel.com/oneapi/documentation/job-submission/

If this resolves your issue, make sure to accept this as a solution. This would help others with similar issues. Thank you!

Regards,

Jaideep

HongbinBao · ‎08-21-2023

Hi,

I use qsub -I -l nodes=1:gpu:ppn=2 -d . Assign me computing node

sycl-ls

Message as follows:

[opencl:cpu:0] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i9-11900KB @ 3.30GHz 3.0 [2023.16.7.0.21_160000]

[opencl:gpu:1] Intel(R) OpenCL HD Graphics, Intel(R) UHD Graphics [0x9a60] 3.0 [22.43.24595.35]

[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) UHD Graphics [0x9a60] 1.3 [1.3.24595]

This error occurs about 20 seconds after running the program:

Running on: Intel(R) UHD Graphics [0x9a60]

terminate called after throwing an instance of 'sycl::_V1::runtime_error'

what(): Native API failed. Native API returns: -1 (PI_ERROR_DEVICE_NOT_FOUND) -1 (PI_ERROR_DEVICE_NOT_FOUND)

Aborted

real 0m20.771s

user 0m8.174s

sys 0m12.573s

Then I check the device information:

sycl-ls

[opencl:cpu:0] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i9-11900KB @ 3.30GHz 3.0 [2023.16.7.0.21_160000]

[opencl:gpu:1] Intel(R) OpenCL HD Graphics, Intel(R) UHD Graphics [0x9a60] 3.0 [22.43.24595.35]

[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) UHD Graphics [0x9a60] 1.3 [1.3.24595]

My sycl select device code is as follows:

cl::sycl::queue deviceQueue(cl::sycl::default_selector_v);

std::cout << "Running on: "

<< deviceQueue.get_device().get_info<cl::sycl::info::device::name>()

<< std::endl;

The queue task submission code is roughly as follows:

    u->queue.submit([&](cl::sycl::handler& cgh) {
    cgh.single_task<class my_kernel>([=]()  {

    });
});
u->queue.wait_and_throw();

In fact, the situation where this error occurs is: I used to allocate shared memory through USM, and then calculate on the GPU. After one calculation is completed, the kernel is interrupted, the result of this time is output, and then the kernel is restarted for the next calculation, and so on. In this case there is no error. But in this case, the efficiency will be very low, so I canceled the output code after the operation is completed, and I want the GPU to not interrupt until all operations are completed, and then this error will appear.

VaishnaviV_Intel · ‎08-22-2023

Hi,

Could you please share us the sample reproducer so that we can investigate your issue more thoroughly?

Thanks & Regards,

Vankudothu Vaishnavi.

HongbinBao · ‎08-23-2023

Hi,

Here's a minimal reproduction of the problem:

The reason for the problem is that there is an infinite loop in the SYCL kernel code. Even if a certain state of the infinite loop will stop, the same error will occur. At the same time, I am not sure how large the for loop will be. This error will appear. This doesn't seem to be the cause of DevCloud, but of dpcpp? I'm not sure why this is a problem.

Create a test file:

infinite_loop.cpp

#include <CL/sycl.hpp>

int main() {
    cl::sycl::queue queue;

   
    std::vector<int> data(1, 42);
    cl::sycl::buffer<int, 1> buffer(data.data(), data.size());

    
    queue.submit([&](cl::sycl::handler& cgh) {
        auto acc = buffer.get_access<cl::sycl::access::mode::read_write>(cgh);

        cgh.parallel_for<class infinite_loop>(
            cl::sycl::range<1>(data.size()), 
            [=](cl::sycl::id<1> idx) {
                for(;;) { 
                    
                }
            });
    });

    queue.wait_and_throw();

    return 0;
}

compile and run

icpx -fsycl infinite_loop.cpp -o infinite_loop

./infinite_loop

This is not caused by devcloud environment problems

Best regards

HongbinBao · ‎08-23-2023

The cause of the problem seems to be: A workload that takes more than four seconds for GPU hardware to execute is a long-running workload. By default, individual threads that qualify as long-running workloads are considered hung and are terminated.

https://www.intel.com/content/www/us/en/docs/oneapi/installation-guide-linux/2023-2/gpu-disable-hangcheck.html

VaishnaviV_Intel · ‎08-28-2023

Hi,

Thanks for sharing the reproducer with us.

>>The cause of the problem seems to be: A workload that takes more than four seconds for GPU hardware to execute is a long-running workload. By default, individual threads that qualify as long-running workloads are considered hung and are terminated.

https://www.intel.com/content/www/us/en/docs/oneapi/installation-guide-linux/2023-2/gpu-disable-hangcheck.html

Did disabling the GPU Hang check resolve the problem for you?

Thanks and Regards,

Vankudothu Vaishnavi.

HongbinBao · ‎08-29-2023

hi，

I don't have permission to execute this command on devcloud

sudo sh -c "echo N> /sys/module/i915/parameters/enable_hangcheck"

@VaishnaviV_Intel wrote:
Hi,

Thanks for sharing the reproducer with us.
>>The cause of the problem seems to be: A workload that takes more than four seconds for GPU hardware to execute is a long-running workload. By default, individual threads that qualify as long-running workloads are considered hung and are terminated.
https://www.intel.com/content/www/us/en/docs/oneapi/installation-guide-linux/2023-2/gpu-disable-hangcheck.html

Did disabling the GPU Hang check resolve the problem for you?

Thanks and Regards,
Vankudothu Vaishnavi.

VaishnaviV_Intel · ‎09-01-2023

Hi,

>>I don't have permission to execute this command on devcloud

Intel DevCloud is a shared environment which comes with pre-installed validated Intel oneAPI software and complimentary packages. As a policy, we do not install custom (open source or 3rd party licensed) software to the environment.

So, We can't help much here. If you still have any issues, do let us know.

Thanks & Regards,

Vankudothu Vaishnavi.

VaishnaviV_Intel · ‎09-08-2023

Hi,

We have not heard back from you.

Do you have any other issues? If no, could you please confirm whether we can close this thread from our end?

Thanks & Regards,

Vankudothu Vaishnavi.

VaishnaviV_Intel · ‎09-13-2023

Hi,

We haven't heard back from you. If you have any issues, please post a new question as this thread will no longer be monitored by Intel.

Thanks & Regards,

Vankudothu Vaishnavi.

encountered problems when using the GPU to run the program on the devcloud

Performance