Media (Intel® oneAPI Video Processing Library, Intel Media SDK)
Access community support with transcoding, decoding, and encoding in applications using media tools from Intel. This includes Intel® oneAPI Video Processing Library and Intel® Media SDK.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!

GPU Hang

Steve_S_5
Beginner
278 Views

Hi,

I'm experiencing fairly frequent GPU hangs when running ~20 processes each doing an mpeg2 decode, rescale, then h264 and jpeg encodes.  This is on Windows 8.1 with the latest drivers.  Is anyone else experiencing this sort of thing?  If so, is there anything that can be done to reduce (and ideally stop) this from occurring?  If it's unusual, then does anyone have any pointer as to what we might be doing that is causing it?  If there are any logging / tracing tools that I can run to provide further details, then let me know.

Cheers,

Steve

0 Kudos
8 Replies
Mark_L_Intel1
Moderator
278 Views

Hi Steve,

Have you solved this issue? It was not quite clear what the root cause would be, but can you first narrow it down, for example, you can decrease the process to 10 to see if the GPU hang still happens?

If this was caused by multiple processes, you should be able to find a number of processes which doesn't cause the hang.

Mark.

Steve_S_5
Beginner
278 Views

Hi,

Not solved it yet - dropping the number of processes encoding isn't an option, since it is a customer requirement to have that number of independent encoders running.

I am working on reproducing the setup they have locally, and will test on both Windows and Centos to see if the problem is common to both systems, or if it just a windows thing.

As a general question, what is the normal cause of a gpu hang?

Mark_L_Intel1
Moderator
278 Views

Hi Steve,

A GPU hang should be a hardware problem in general. During the configuration time, the video processing pipeline is setup, but this is not the place where the video processing happens. The video processing will happen in the loop after the buffer initialization completed, so the cause of the problem is in the configuration or the initialization stage.

The cause might be the memory configuration or the decoder parameters, since the multiple processes share the memory pool, this might also be the problem. So my suggestion for reducing the process number is for investigation not a solution, if we can find the number of the process that doesn't cause the problem, then we would know the multiple process is the cause.

Mark Liu

Steve_S_5
Beginner
278 Views

Hi Mark,

Thanks for the quick responses.  So right now I am getting a local environment on which I can repro - this may take a day or two, since the hangs don't happen all the time.  Once I have a repo, I'll then do a binary chop on the number of processes to see if we can work out what level of processing is causing the issues.

Is there any trace I can get out of the driver that would help narrow this down?

Tamer_Assad
Innovator
278 Views

Hi Steve,

 

Did you experience this on different Windows or Driver versions? 

did you try another GPU?

what is maximum/average GPU memory occupation, considering all active sessions and streams?

are you scaling frames on the GPU, using VPP?

Intel GPA and Intel VTune should help in profiling.

 

Best regards,

Tamer Assad

Steve_S_5
Beginner
278 Views

Hi Tamer,

GPU and Windows versions are set by our customer, so we don't have much choice there.  GPU is the Iris Pro 580.  GPU load is around 60%, with memory usage of ~630MB (as reported by GPU-Z).

We are scaling frames with VPP in the GPU.

Cheers,

Steve

Tamer_Assad
Innovator
278 Views

Hi Steve,

Applying some changes to Your development environment and current implementation actually, as Mark mentioned, targets narrowing the possibilities of the problem and eventually identifying root cause.

In other words, this might be a drivers issue, device overheated, or implementation.

For now, we might only Assume this is not a GPU memory usage issue.

 

Best regards,

Tamer Assad

Steve_S_5
Beginner
278 Views

An update on this issue.  On my original system (with the 20 concurrent processes), I am seeing GPU hangs roughly once per day.  

I have setup another system that is only running 10 concurrent processes, and have not yet seen that one go wrong.

The pipeline that I am running is:

mpeg2 decode -> vpp rescale -> jpeg encode

                         -> h264 encode

So the mpeg2 decode outputs are going to both a VPP and an Encode job; as required by the docs, I have two joined sessions.  The first session is used for the decode, vpp and jpeg encode and the second session for the h264 encode.

I briefly ran one additional test on the 20-concurrent process system with a more complex pipeline:

mpeg2 decode -> vpp rescale -> jpeg encode

                         -> vpp rescale -> jpeg encode

                         -> vpp rescale -> jpeg encode

                         -> h264 encode

So 2 additional rescales and jpeg encodes (with 4 joined sessions).  This caused GPU hangs pretty much straight away.  Interestingly, although possibly unrelated and a subject for a different thread, all of the jpeg encodes are warning about partial acceleration; I thought that on gen-6 skylake, jpeg encodes would be hardware based.

Any ideas on the GPU hang issue?  The system is for a video surveillance package that will run 24x7 - having periodic hangs is unlikely to be acceptable to our customer.  Clearly it is load related, but the GPU usage is only showing around 54%, so no where near maxed out.  And even if it was, I would just expect it to fail to keep up with realtime, not to cause sporadic hangs.

Ideas would be much appreciated!

Cheers,

Steve

Reply