I'm trying to determine how many QuickSync H264 encodes I can simultaneously run (at real-time) on a 3rd-gen i7 @ 2.1GHz w/ 16GB of RAM. My source content is encoded using HEVC, 640x360, 30f/s @ 550kb/s, packaged in MPG2TS/UDP. I am transcoding that material (in real-time) to H.264-MP, 640x360, 30f/s @ 900kb/s, packaged in HLS. I am using a slightly modified version of ffmpeg 2.7 which includes a highly-optimized CPU-only HEVC decoder, and a version of the h264_qsv encoder for ffmpeg which still supports the 3rd-gen i7. The box runs 64bit Ubuntu 12 w/ VA-API version 0.34, driver version 184.108.40.20683. I have confirmed that QuickSync is configured and running properly (vainfo returns supported profiles for H264 encoding).
As an initial test, I tried a non-realtime transcode operation (HEVC, 640x360, 30f/s @ 550kb/s, packaged in MP4 to H.264-MP, 640x360, 30f/s @ 900kb/s, packaged in MP4). I am achieving roughly 700f/s in this configuration. While ffmpeg is running, the overall CPU utilization is ~25% (spread across the 4 cores). This would suggest to me that 700f/s is the upper limit of what QuickSync can do on this platform (e.g., the GPU is probably pegged at 100%).
Dividing 700/30 suggests that the platform should be capable of at least 15 *simultaneous* real-time HEVC>H264 encodes?
I then spun up 15 instances of FFMPEG in my 'production' configuration (e.g., real-time HEVC/MP2TS/UDP in, real-time H264/HLS out). Everything appears to be working, and my overall CPU utilization (spread across the 4 cores) is again about 25%.
What is interesting is that the 'load average', as reported by top, hits around 4 (if I don't force the ffmpeg instances to run on a specific core), or 10 (if I force the 15 ffmpeg instances to run on the same hyper threaded core (e.g., via taskset -c 0,1).
Obviously, the load average of 10 is concerning, although my CPU utilization is quite reasonable. I understand that load average is the number of processes sitting in a queue waiting on (I thought) CPU or I/O resources to become available. Does 'load average' on linux also take into account a process waiting on GPU availability (e.g., QuickSync availability)? If so, does that indicate that QuickSync cannot keep up with my 15 real-time requests? If I spot check the output, notably everything looks ok.
So should I be worried? Is the high load average expected in this scenario? Will it impact other things? Does it suggest I'm 'over budget' on QuickSync resources on this machine? Is there another way to determine if QuickSync is able to maintain real-time encoding?
Sorry it took us long to respond. Thanks for the detailed information.
I believe what you mean here "'load average', as reported by top, hits around 4 (if I don't force the ffmpeg instances to run on a specific core), " is that you are running 1 ffmpeg instance and in second test you are running 15 ffmpeg instances. from your pipeline the majority of the load is on CPU starting from demuxing the content to get HEVC bitstream(happening on Core) ----> Decode(on Core)------>Encode(on GPU - Execution Units and fixed function)-----> HLS streaming.
It's highly likely to see the load average to be high since it involves decoding on the Core. No, load average doesn't count process waiting on GPU availability. I don't believe your GPU to be over budget by the quick sync encode .Quick test you can do to verify if quick sync can do real time transcoding would be to test complete transcode pipeline on GPU(H264 to H264 transcode or MPEG2 to H264 transcode) or move everything to CPU(HEVC to H264) and see how many instances you can do. This will give you a rough idea of the bottleneck in your pipeline. Check what is the GPU frequency your system is set up on, this will directly affect the GPU performance.
Please correct me, if I got a wrong idea about your pipeline. Hope this helps.
thanks! Looking at the ffmpeg quick sync code, it tries to check QuickSync busy status, and if it is busy, then it sleeps for a very short period (e.g., 1000 microseconds), then tries again. When I run 15 ffmpeg instances simultaneously, they are spending a lot of time in that sleep/retry loop. My thinking is that if we instead run 1 ffmpeg process (which handles multiple input/output threads), we could modify the qsv encoder code to try to better marshal the QuickSync resources, rather than having so many threads wait/retry/fail/wait/retry/fail, which contributes to the load average.
In a semi-related note, we are running on a third-gen Intel Core i7-3612QE CPU @ 2.10GHz. It appears this is not supported by the current Media Server Studio Linux SDK. What is the last version of the Linux QuickSync SDK to support the third-gen i7s? Is it still available for download somewhere?
Yes, at the warning of device busy, it is advisable to sync or sleep for few milliseconds, however it is not always that you will run into device busy option. You should not see it every now and then with proper surface allocation. For details on how the MSDK pipeline works, I would recommend you to check this article - https://software.intel.com/en-us/articles/framework-for-developing-applications-using-media-sdk ;
We are currently supporting 5th and 4th generation for Linux, I believe the last 3rd generation support was earlier this year i.e. Media Server2015 R3 which is not available for public download since we don't support the 3rd generation.
Hope it answers your query.
unfortunately, it isn't feasible for us to simply upgrade our hardware (we supply equipment to the airborne market). We will be on a 3rd-gen i7 for the foreseeable future. Given that the hardware platform supports QuickSync, we would like to take advantage of that feature, understanding that there will be some limitations given the age of the platform.
Would it be possible to obtain a copy of the last SDK to support third-gen i7 under a support contract? Or privately? It seems like there must be a solution here for 3rd-gen i7 customers.