Intel® oneAPI Base Toolkit
Support for the core tools and libraries within the base toolkit that are used to build and deploy high-performance data-centric applications.
417 Discussions

GPU freezes until rebooting (USM)

RN1
New Contributor I
1,386 Views

Hi,

 

I have found that when using a single huge buffer (3.5GiB) for output computation, it segfaults (if printing values) or freezes (if not printing the values) with USM.

When not printing the values:

#0  0x00007ffff6e7550b in ioctl () at ../sysdeps/unix/syscall-template.S:78
#1  0x00007ffff1247251 in NEO::Drm::ioctl(unsigned long, void*) ()
   from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#2  0x00007ffff123e627 in NEO::BufferObject::wait(long) ()
   from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#3  0x00007ffff122fd92 in NEO::MemoryManager::freeGraphicsMemory(NEO::GraphicsAllocation*) ()
   from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#4  0x00007ffff101f2b2 in L0::FenceImp::~FenceImp() () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#5  0x00007ffff101f2cd in L0::FenceImp::~FenceImp() () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#6  0x00007ffff108e2ee in zeFenceDestroy () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#7  0x00007ffff4b601dd in piQueueRelease ()
   from /opt/intel/oneapi/compiler/2021.1-beta10/linux/lib/libpi_level_zero.so
#8  0x00007ffff6fbfae3 in std::_Sp_counted_ptr_inplace<cl::sycl::detail::queue_impl, std::allocator<cl::sycl::detail::queue_impl>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from /opt/intel/oneapi/compiler/2021.1-beta10/linux/lib/libsycl.so.5
#9  0x000000000040b0c9 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x82afb0)
    at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/shared_ptr_base.h:155
#10 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7fffffffc7a8)
    at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/shared_ptr_base.h:730
#11 std::__shared_ptr<cl::sycl::detail::queue_impl, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (
    this=0x7fffffffc7a0)
    at /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/shared_ptr_base.h:1169
#12 cl::sycl::queue::~queue (this=0x7fffffffc7a0)
    at /opt/intel/oneapi/compiler/2021.1-beta10/linux/include/sycl/CL/sycl/queue.hpp:46
#13 Options::~Options (this=0x7fffffffc6c0)
    at /home/user/pone.cpp:54
#14 main (argc=<optimized out>, argv=<optimized out>)
    at /home/user/pone.cpp:676

For example, after the GPU computation, when I try to read the values in host (a simple std::cout of some specific values of the huge buffer).

The problem comes here: the gpu cannot be used again (even a simple `intel_gpu_frequency` is blocked forever). I am using an i5 with a 630 GPU. I am running with a bit less frequency and with /sys/module/i915/parameters/enable_hangcheck to N.

I waited up to 7h. Checking the `intel_gpu_top` it has been with the Render/3D/0 engine bar at 99-100% all the time (2100 MiB/s IMC reads, around 2.2Watts)

If I try to execute again the program, it is freezed here:

Thread 1 "pone_auto" received signal SIGINT, Interrupt.
0x00007ffff6e6389b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
78      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007ffff6e6389b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
#1  0x00007ffff101e08e in L0::EventImp::hostSynchronize(unsigned long) ()
   from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#2  0x00007ffff4b641e7 in piEventsWait ()
   from /opt/intel/oneapi/compiler/2021.1-beta10/linux/lib/libpi_level_zero.so
#3  0x00007ffff70b4ec8 in cl::sycl::detail::event_impl::waitInternal() const ()
   from /opt/intel/oneapi/compiler/2021.1-beta10/linux/lib/libsycl.so.5
#4  0x00007ffff70b5d80 in cl::sycl::detail::event_impl::wait(std::shared_ptr<cl::sycl::detail::event_impl>) const () from /opt/intel/oneapi/compiler/2021.1-beta10/linux/lib/libsycl.so.5
#5  0x00007ffff71523cd in cl::sycl::event::wait() ()
   from /opt/intel/oneapi/compiler/2021.1-beta10/linux/lib/libsycl.so.5
#6  0x0000000000405f1b in process_pone (cpu=false, opts=opts@entry=0x7fffffffc6f0)
    at /home/user/pone.cpp:150
#8  main (argc=<optimized out>, argv=<optimized out>)
    at /home/user/pone.cpp:573

The only solution is to reboot.

Any idea how can I reuse the GPU without rebooting?  I accept the segfault, but not that the GPU is freezed forever.

I tried to use the `intel_gpu_abrt` (in case it can help, I don't know), but it says `bad substitution`.


Of course, there are no zombie/alive processes after it segfaults. So, I cannot force killing anything.


Something important is that the max alloc is 3.06GiB. So, I don't understand why it allows computing. On the other side, with OpenCL I can run and print the values (no idea why).

 

Global memory size 6577778688 (6.126GiB)
Max memory allocation 3288889344 (3.063GiB)

 

So, summary. I can in theory alloc a buffer of up to 3.06GiB. I alloc a buffer of 3.5GiB.

  • USM: freezes (when destroying the queue without printing the values) or segfaults (when printing values computed in the gpu)
  • NO USM: finishes correctly printing or not the values.
  • OpenCL: finishes correctly printing or not the values.
0 Kudos
6 Replies
RahulV_intel
Moderator
1,374 Views

Hi,

 

Kindly mention your oneAPI base toolkit version and OS details.

Max memory allocation by definition is the maximum memory that can be allocated on a single data structure.

 

In case of the buffer/accessor model, if you try to allocate memory greater than the limit (to a single buffer), it should ideally fail. But, in your case, it didn't. Could you please share the reproducer code for the same?

 

Also, in the case of USM, can you try to include exception handling in your code and let me know if it helps. Please share the reproducer code for USM as well.

 

 

Regards,

Rahul

0 Kudos
RahulV_intel
Moderator
1,355 Views

Hi,


Any updates on this?


Thanks,

Rahul


0 Kudos
RN1
New Contributor I
1,351 Views

Sorry for the delay.

I have done so many tests that I cannot find the exact case to extract the minimal viable code to reproduce it. Give me a few days and I will try again.

My current version of oneAPI basekit is the latest beta 10, with ubuntu 20.04.

But what I have tested today is:
- execute GPU code that needs 5 seconds to compute
- kill the process
- check intel_gpu_top and you can see that the Render bar is computing up to those 5s.
So, this is what I observed when I opened this thread. No matter there is no host process running, the GPU is still running what was sent. And in my case, something happened with the GPU that stuck forever running (Render 99-100%). There were no OS processes and my only solution was to reboot.

How can we force the GPU to free it, any running jobs?

0 Kudos
RahulV_intel
Moderator
1,326 Views

Hi,


I've run iGPU kernels for more than 5 seconds, also tried oversubscribing the iGPU max memory (USM). I did not observe any hang issue. Could you try upgrading your iGPU drivers to the latest version and let me know if it helps.


Meanwhile, if you could send a reproducer for your test case, that'd be great.


Also, try out with both Level0 as well as OpenCL backends, and let me know if you notice any anomaly.

Env variable to set the backends (Level0 by default):

SYCL_BE=PI_OPENCL

SYCL_BE=PI_LEVEL_ZERO



Thanks,

Rahul


0 Kudos
RahulV_intel
Moderator
1,299 Views

Hi,

 

PFA sample code in which I have tried allocating memory greater than max memory allocation using USM.

#include <CL/sycl.hpp>
#include <iostream>
#include <vector>

int main() {
    sycl::queue gpuQ(sycl::gpu_selector{});
    auto max_mem_alloc_gpu = gpuQ.get_device().get_info<sycl::info::device::max_mem_alloc_size>();
    std::cout<<"Max Mem alloc GPU: "<<max_mem_alloc_gpu<<std::endl;

    auto max_ints_gpu = (max_mem_alloc_gpu / sizeof(int));

    auto try_max = max_ints_gpu * 2;

    std::cout<<"Max ints GPU: "<<try_max<<"\n";
    std::vector<int> max_vector_gpu(try_max, 2);

    int *gpu_alloc = (int *) sycl::malloc_device(try_max * sizeof(int), gpuQ.get_device(), gpuQ.get_context());

    gpuQ.memcpy(gpu_alloc, &max_vector_gpu[0], try_max * sizeof(int)).wait_and_throw();

    gpuQ.submit([&](sycl::handler &h){
            h.parallel_for(sycl::range<1>{(unsigned long)try_max}, [=](sycl::id<1> i){
                gpu_alloc[i]*=10;
        });
    }).wait();

    gpuQ.memcpy(&max_vector_gpu[0], gpu_alloc, try_max * sizeof(int)).wait();

    std::vector<int> gpu_val_chk(try_max, 20);

    (max_vector_gpu == gpu_val_chk) ? std::cout<<"Success\n" : std::cout<<"Failure\n";

    return 0;
}

 

I have not noticed any hang issue with the above code sample. I'd request you to try running the above code and let me know if you see any issues.

 

Thanks,

Rahul 

0 Kudos
RahulV_intel
Moderator
1,289 Views

Hi,


I have not heard back from you, so I will go ahead and close this thread from my end. Intel will no longer monitor this thread. Further responses on this thread will be considered community only.


Please post a new question once your test case is ready.



Thanks,

Rahul


0 Kudos
Reply