Media (Intel® Video Processing Library, Intel Media SDK)
Access community support with transcoding, decoding, and encoding in applications using media tools like Intel® oneAPI Video Processing Library and Intel® Media SDK
Announcements
The Intel Media SDK project is no longer active. For continued support and access to new features, Intel Media SDK users are encouraged to read the transition guide on upgrading from Intel® Media SDK to Intel® Video Processing Library (VPL), and to move to VPL as soon as possible.
For more information, see the VPL website.

Skylake HW HEVC encoding too slow to keep up with live video, depending on the frame rate and target usage

AaronL
Beginner
373 Views

I've finally incorporated Skylake HW HEVC encoding and decoding into my software, and it is clear that Skylake HW HEVC encoding isn't fast enough to keep up with live video, depending on the frame rate and target usage.  For example, with 720p59.94, if I use Skylake HW HEVC encoding with a target usage of MFX_TARGETUSAGE_BEST_QUALITY, each SyncOperation() call takes about 17-21 ms to complete.  With a frame rate of 59.94 fps, the time gap between each frame is about 16.6833 ms, so 17-21 ms is too slow to keep up with live video in this case.  If I use HW H.264 encoding on the same system, each SyncOperation() call instead takes about 1.8-2.0 ms.  If I discard every other frame, for an effective frame rate of 29.97 fps, I can get by.  Skylake HW HEVC encoding is also not fast enough to keep up with live 1080i59.94 using the best quality target usage.

However, it is possible to get this to work, if a different target usage is used.  Using data from Concurrency Viewer, I've determined the average times for the call to SyncOperation() for 720p video based on the target usage:

  • Best quality (1): 17-21 ms
  • 2:  16-22 ms
  • 3:  1.5-2.6 ms
  • Balanced (4):  1.5-2.4 ms
  • 5,6:  in between 4 and 7
  • Best speed:  1.3-2.2 ms

I find it suspicious that the SyncOperation() durations for 1-2 and 3-7 are so similar.  Do each of the target usages result in different encoding settings for Skylake HW HEVC encoding, or are there really only 2 target usages that are implemented?

Is there any expectation that the results for target usages 1-2 will improve over time?  Perhaps there are some settings I can change to improve these results when using target usages 1 or 2?

Also, here's information about the system that I used to get these results:

  • Processor:  i5-6600K
  • OS:  64-bit Windows 10
  • Intel graphics driver:  15.40.7.4300
  • Intel Media SDK:  6.0.0.388

I also experienced similar results (that is, not being able to keep up with live video with the best quality target usage) on a similar system with an i7-6700T processor.

0 Kudos
5 Replies
Sravanthi_K_Intel
373 Views

Hi Aaron, Thanks for the detailed evaluation - very helpful. I am going to perform some tests myself and get back to you. The best quality target usage is not recommended for performance testing in general. Our TU4 mode is a good mode for balanced speed and quality as you are aware. May I ask if you are comparing the TU performance more for educational purpose or are you observing significant quality improvement in TU1 mode warranting you to go with that as opposed to TU4 or other modes?

(On a side note: Since you are talking about live video, I wanted to make sure you are aware of the VDEnc or low power encoder for video conferencing mode as well for complete picture. You find the detailed slides and talk about it in IDF 2015 Technical Sessions, tited "Enhancing 4K Media Experience in Power Optimized Intel® Graphics, Gen9").

0 Kudos
Sravanthi_K_Intel
373 Views

Hi Aaron - Can you also let us know what sample you used, bitrate mode used for the experiment? Instead of ms, can you report the performance numbers in FPS so that we are on the same page.Thanks.

0 Kudos
AaronL
Beginner
373 Views

Could you answer my questions as to whether or not each of the different target usages actually are meaningful with Skylake HW HEVC?  As I mentioned, the SyncOperation() times for 1-2 and 3-7 are suspiciously similar, which makes me wonder if they really are different.  In addition, the times for 3-7 are basically the same as what I get when I do H.264 encoding, which is also suspicious.  It makes me wonder if target usages 3-7 are really doing HEVC--perhaps they are instead doing H.264 encoding that is being disguised as HEVC.  I've read that HEVC encoding is roughly 10x the complexity of H.264 encoding, and given that target usages 3-7 take the same amount of time as H.264 on best quality, I have to wonder if I'm really getting the benefits of HEVC when I use target usages 3-7.  The SyncOperation() times for target usages 1-2, however, are roughly 10x that for H.264.

These results were determined using my own software, but it should be a simple matter to see this with the sample_encode sample.  All you need is a 720p YV12 or NV12 input file and specify that the frame rate is 60 (can't do 59.94, or rather 60000 / 1001, with the way sample_encode is currently setup).  There is already a way to configure the target usage, but it would be a simple matter to add support for target usages 2-3 and 5-6, in addition to 1, 4, and 7, which are already supported.  I don't have the means to report the FPS, but using my numbers, it is fairly easy to calculate it.  Also, my numbers are associated with using VBR and system memory.  I found that when video memory is used, it is usually even slower.  I also did some basic testing with CBR, and that doesn't change anything--that is, with CBR, the encoder isn't fast enough to keep up with the 59.94 fps frame rate, at least for target usages 1-2.  I also tried with AVBR, but it appears that AVBR isn't currently supported with Skylake HW HEVC encoding, since after calling Query(), it changes the target usage from AVBR to CBR.

I couldn't find the IDF 2015 presentation that you mentioned.  Can you provide a link to it?  However, i probably won't be so interested in VDEnc, since I suspect that you are talking about the new H.264 encoding mode available in Skylake that was mentioned in an Anandtech article.  I'm interested in HEVC because I can supposedly get the same quality as H.264 at half the bit rate.  Regular HW H.264 encoding and decoding are already plenty fast for live video, and this has been true since at least 3rd Generation Intel Core processors.

0 Kudos
Sravanthi_K_Intel
373 Views

Hi Aaron - Reg your original question "Could you answer my questions as to whether or not each of the different target usages actually are meaningful with Skylake HW HEVC? ", I do not have the complete answer yet and I am working with folks to understand your observation. I wanted to gather as much information beforehand - which includes your experiment parameters and reproducability on our end. One thing worth mentioning is that the HEVC HW encoder in SKL is using the VME module from H264 to simplify HEVC motion search. I am not confident to claim if that is affecting the TU performance you are observing - again, have to consult and get back to you.

As you understand the SKL HEVC HW is quite new and we are ramping on it (HW understanding and SW too) - so please bear with us here. 

0 Kudos
Sravanthi_K_Intel
373 Views

Hi Aaron - Based on some experiments and discussions, here is the info regarding target usages. " Do each of the target usages result in different encoding settings for Skylake HW HEVC encoding, or are there really only 2 target usages that are implemented?" - For HEVC HW Encoder, there are 3 target usage modes (TU1, 4 and 7), while other modes are mapped on the previous. So your general observation about 3 buckets of target modes instead of 7 distinct ones is correct. (You found 2 buckets instead of 3 - see below).

"Is there any expectation that the results for target usages 1-2 will improve over time?  Perhaps there are some settings I can change to improve these results when using target usages 1 or 2?" -->Yes, going forward, all target usages will see improvement in quality and performance. Regarding the performance of each of the target modes - frame encoding time depends on frame type, bitrate, number of reference frames and parallelization from the asynchronous mode. Our experiment numbers are not in line with yours - we used video memory instead of the system memory (so that's one reason). Nevertheless, here is the methodology we used for our performance numbers. Our numbers show distinct jump from TU1->TU4->TU7 in CQP/CBR modes for low delay as well as regular use-cases. In our experiments, we performed transcoding (h264->h265) instead of encoding to minimize file i/o overhead - sample_multi_transcode. For the low delay mode and regular mode, we used GopPicSize=256, asyncDepth=3. While the GopRefDist=1 and numRefFrame=1 (or 3) for low-delay mode and GopRefDist=3 and numRefFrame=2 (or 4) for the regular B-frame mode. 

Hope this helps.

0 Kudos
Reply