h264_qsv encoder can't keep up with live capture?

Jason_P_ · ‎10-26-2016

We are using ffmpeg to capture live video from a Blackmagic DeckLink Mini Recorder and encode to multiple h264 outputs (streamed over RTMP and written to disk). We're running CentOS 7.2 with Media Server Studio 2017 on a Core i7-6700K.

We were very impressed with h264_qsv performance when we did multiple transcodes of a 1920x1080 MP4 file -- we were able to do 4 independent transcodes to 7 different locations (3 streams, 4 files) and it all ran in 3x real time. And the CPU impact was minimal (maybe 10-15%).

Based on this performance, we thought that it would be trivial to do real-time encoding of HD video.

Unfortunately, we have found that the 1 GB frame buffer will fill up, causing frames to drop. Sometimes it takes an hour before this behavior happens, and sometimes it only takes a few minutes. But it seems to always happen.

Here is a sample invocation:

https://gist.github.com/jpriebe/9acdc3beb50547449bd47f4fb46de214

(normally we would "tee" the low/medium/high encodings to an RTMP streaming server. I have removed that to make the example a little simpler)

We have tried everything to streamline/speed up the QuickSync encoding:

- use veryfast preset
- fix the min/max bitrates to use CBR
- force the GPU clock to 1150MHz

No matter what settings we use, the buffer will overrun eventually. The strange thing is that the GPU load (as measured by the metrics_monitor utility) is less than 20% and clock speed is confirmed to be at 1150MHz. So it doesn't seem like the GPU is overloaded.

We tried using libx264 for some of the encodings to the CPU, to no avail. If even one of the four encodings is done on the GPU, we will get an overrun.

By comparison, if we run all 4 encodings on the CPU, we can run indefinitely with no buffer overrun.

It really seems like there is some sort of bottleneck in the h264_qsv encoding. I don't know if it's a problem in ffmpeg, with QuickSync itself, or just in my understanding of how it all works.

Any insight would be much appreciated. Thanks!

Tamer_Assad · ‎10-26-2016

Hi Jason

The process load as I understand it, might look like this:

---------

( 1920 X 1080 X 1.5 ) X ( 4 ) X ( 60 ) = 711.9141 MB/s

(YUV frame size) (Trancodes) (FPS)

--------

Apparently there is a couple of scaling as well, I suppose on VPP. (Additional GPU mem)

What is the GOP size? reducing that, if possible, might result in lowering the GPU memory usage.

Best regards,

Tamer

Mark_L_Intel1 · ‎10-26-2016

Hi Jason,

h264_qsv is a plugin to the ffmpeg framework, to debug and resolve the ffmpeg usage issue takes time and is not the scope of this forum. We are focusing on the media sdk only problem.

So I suggested to submit the question to FFMpeg Mailing List, but in any case, if you can find a solid prove that this is the media sdk issue, you can come back and ask again.

Mark Liu

Jason_P_ · ‎10-26-2016

Tamer, thanks for the reply!

I believe that the filter we're applying does the scaling in the CPU -- once to 1280x720 and once to 960x540. Two of the h264_qsv encodes are using the 1280x720 video, and two are using the 960x540.

Our frame rate is 29.97, so it's not as bad as 60 fps.

We aren't specifying the GOP size, but when I've looked at it, it seems to be 250 frames.

I'm hesitant to explicitly set the GOP size, because every time I've done that, I get nasty pulsing effects in my h264 video. But maybe I could try reducing it to something like 150 or 180 to see if that makes a difference.

But to your overall theory, if I were simply sending too much pixel data for the GPU to handle, wouldn't it fall behind almost instantly, and overrun my capture buffer within a minute or so? I don't see how it can run just fine for an hour, then start falling behind.

One thing I forgot to mention is that the buffer often seems to start filling up when we encounter a complex scene. Most of what we encode is talking heads from newscasts. But some video has lots of visual complexity, and I feel like that's when the encoder starts to choke. It's almost like the encoding algorithm has some non-deterministic elements to it, and maybe it's recursing or iterating to squeeze the bitrate down. I just want it to "try its best" and process the scenes in a constant amount of time, no matter how complex the video is.

Tamer_Assad · ‎10-27-2016

Hi Jason,

You are welcome!

Actually I was suggesting that you estimate the process load, so then you can understand the hardware limitations, and further manage your solution to best fit the hardware.

in order to work this out, I suggest the following experiments:

1. Migrate your solution to Media SDK directly. -> further support + better resolution.

2. Try variant GOP sizes (Min-x-Max) , and profile your solution. -> understand GOP size effect.

3. Try variant bitrates on CBR, and try Variable bitrate settings, and profile. -> bitrate effect + additional freedom to the encoder

4. Process recorded media (variant known samples), from a file not live. -> Accelerate your test process + unify use cases

5. In case all the above were not helpful, attempt to breakdown the problem to smaller blocks, isolate each sub process, profile it and further validate it, ex. separate decoding process (to files), image processing, encoding process.

Video encoder uses reference frames for coding relevant frames within the same GOP, "complex scenes" might result in generating a new I-Frame in standard video processing, I assume that in Extreme situations, handling I-Frame generation on strict GOP size and limited bitrate settings, could be problematic.

Best regards,

Tamer