Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

H.264 encoder questions

Intel_C_Intel
Employee
1,089 Views
Hi there!

was typing a message half an hour and after pushin' "Post" it's just disappared..., so here is reduced version.

I've looked into H.264 encoder sample and found a lot of interesting primitives like 422, more than 8-bit depth coding, depricated lossless coding etc and was hoping that basic primitives like ones for intra prediction and motion estimation are optimized that well, so no way to bother, but seems I've been wrong (at least for 5.2)

What I've noticed:
- there is no accelerated version of 8x8 intra prediction
- SAD functions (which, you know, take a lot of encoding time) are badly organized - there are 16x16, 8x8, 4x4 but nothing in between. Also functions have no specification about data alignment, so they either do check every time or use ineffective unaligned read operation.
- there is no way to perform effective sub-pixel motion estimation, as the only tool to access sub-pixel location is full process of interpolation, thus recalculating nearly 90% of half-pixel location using 6-tap filter (both vertical and horizontal)

There are several deeper technical issues about performance and quality of h.264 encoder primitives, but I'm leaving then for further discussion.

I'd be glad to hear your opinion and to know if there are plans to work these issues out and when to expect them.

with regards (?????? ???? ? ?????? ? ??????!)
ru
0 Kudos
13 Replies
Intel_C_Intel
Employee
1,089 Views
Hi Nikolay,

thanks for reply at first!

1. yes, I do mean that. The interface of the intra prediction functions are kinda straight - it requires the adjusted pixels (from left and top macroblocks) to be located as they appear in the picture, so, we either have pass pointer to picture to these functions (which means we can't do deblocking filter on-the-fly) or to pass pointer to different buffer with same layout of adjusted pixels (in this case it is easier to pass pointers to adjusted pixels, so left pixels can be directly read using SIMD intrustions). So I'd prefer functions look like this:
h264_intra_predict_XXX(unsigned char * pred_buffer, unsigned char * left_buffer, unsigned char * top_buffer, int mode, unsigned int flags)
Upper-left pixel can be either in left_buffer[-1] or in top_buffer[-1], which one is more convenient.
Also, intra prediction functions are 8-bit depth only.

2. a. For example 2 4x4 SADs to perform single 8x4 SAD will be significantly slower, as MMX operates on 8 bytes at once. Same for 2 8x8 SADs for 16x8 SAD and SSE2. Also additional call + all same alignment and parameter checks. I really see no point to perform any error checks on parameters - IPP (especially video/audio coding part of it) is very low level SDK and it is used only by the guys who really know what they're doing, and SADs and some other related routines are so small that no one will ever check any error codes from them, and even if someone will - it will lead to even more overhead (conditional jumps etc).
b. All motion estimation/compensation functions are differs only in block sizes, so it'll be very convenient to have a structure of SAD/MC routines and to pass a pointer to SAD or MC function as a parameter for motion estimation of macroblock compensation. This will save 80% of identical code for MC and will reduce the code of motion estimation. (of course it is possible to make a wrapper for SADs for all block sizes, but this adds additional function call to all alignment and error checks)

3. There are two possibilities (well, at least).
- To build half-pxels planes for each picture in DPB. This will work fine for frame and field cases without explicit weighted prediction and of course will require two different buffers for MBAFF. Pretty fast, but will require roughly 3x times (6x time for MBAFF) memory encoder uses now.
- To build half-pixel buffer around some point, where encoder wants to do search. 2 half-pixels left/top/right/bottom from given position. This will work fine in all configurations without additional memory requirements, but will be a bit slower than first case, due to recalculation of corner positions and recalculation of half-pixels for different block sizes.

Thanks in advance for your attention to the subject,
ru.

PS. We may talk on skype if you'd like - just drop me an email.
0 Kudos
registered_user
Beginner
1,089 Views
I'm just curious - is anybody in Intel interested in improving encoder performance or not?

ru.
0 Kudos
Vladimir_Dudnik
Employee
1,089 Views

Hello,

if you carefully look for evolution of UMC codecs you may notice that we try to improve it with each new version. So, of course Intel is interested in improving H.264 codec quality and performance, and we are working on it. Sorry if we did not solve every issue immediately.

Regards,
Vladimir

0 Kudos
registered_user
Beginner
1,089 Views
With all respect Vladimir, it's not about solving each issue immediately. It's about at least an attention to the problem - I've offered a help for your developers by pointing to 3 major issues and nobody answered for a week.
Believe me, what I really want is to speed up our product to the market and to be competitive to other commercial products. I believe that pointing to the problems in design might help you to improve IPP, which in turn might help developers using IPP to improve their products.

What I'd like to hear is do I need to code all these in assembler by myself to make encoder somehow comparable to good implementations or are some improvements in these directions expected in the near future in IPP?

Thank you.
0 Kudos
Vladimir_Dudnik
Employee
1,089 Views

Actually your points arecurrently in active discussion within our team of H.264 experts. We have to count on many different factors during design, so it is just impossible to immediately implement every suggestion from outside. We need to find a way to satisfy many contradictional conditions. Finally our experts will come in with at least comments on your points. And of course we are continue to invest into development of H.264 codec (considering many different directions from complete software implementation, optimizedfor the future generations of Intel processors and up topossible hardware assistance either with Intel chipset or GPU).

Regards,
Vladimir

0 Kudos
registered_user
Beginner
1,089 Views
Thanks. If I may be of any help, please do not hesitate to contact me.
0 Kudos
Vladimir_Dudnik
Employee
1,089 Views

sure, thank you.

Vladimir

0 Kudos
andrewk88
Beginner
1,089 Views

Hi All,

In my view, what would be beneficial is to stay focused on a design and efficient implementation of low-level and mid-level sets of H264 "encoding primitives"representing a majorprocessing blocks required by an encoder. Examples of low-level blocks as discussed here would be: SADs, interpolations, transforms, etc. Ideally, depending on a requirements of encoding application/system (e.g. encoding speed (real-time?) vs. quality, spatial frame resolution vs framerate, number of reference frames, CPU speed vs. amount of memory available etc)I can imagine you could built any encoding system you need by appropriate composition and configuration of "encoding primitives". And exactly that more or less is happening ...so that's whyI'd wishfor growing set of flexible/configurable "encoding primitives" to be provided. At the endmultiple code samples should show how tocompose and configure a "typical category" of encoding application (e.g. real-time performance vs. "high quality" file encoding, encoding on unconstrained systemlike PC vs. portable device).

Best regards,

Andrew

0 Kudos
evgenyka
Beginner
1,089 Views

Hi Vladimir,

I've finished comparative evaluation of H.264 encoder compiled with MSVS 2005 using IPP V5.2 and V5.3.

The following summaries test results (baseline profile, one reference frame, SA=16, Intel Core 2 Duo 1.9 GHz)

Single thread: (foreman_qcif_30Hz.yuv, 300frs)

IPP 5.2: 465 FPS (up to P8x8, without any dependency to ME method)

IPP 5.3: 250 FPS (up to P8x8, EPZS ME algorithm)

Single thread: (foreman_qcif_30Hz.yuv, 300frs)

IPP 5.2:30 FPS (up to P4x4, without any dependency to ME method)

IPP 5.3: 220 FPS (up to P4x4, EPZS ME algorithm)

Two threads: (foreman_qcif_30Hz.yuv, ,300frs)

IPP 5.2:not supported

IPP 5.3: not supported

Im quite surprised to see that the performance of IPP 5.3 is much worse than IPP 5.2.

Other minor issues:

  • Encoder performance dramatically depends on output bitrates (so lowering in bitrate highly increases H.264 IPP performance)

  • New release still uses only one single virtual CPU (one-thread) (multiple slices, t -2 etc. do not change any things).

So, what I should do to improve performance of V 5.3 ?

Best Regards,< o:p>

Evgeny

V 5.3

Fixed QP=28,

up to P8x8

Starting H264 encoding foreman_qcif_30Hz.yuv to file test.264

Source video width = 176, height = 144, frameRate = 30.00

Max frames to encode = 300

Encoding bit rate = 100000 bits per second

0.10.20.30.40.50.60.70.80.90.100.110.120.130.140.150.160.170.180.190.200.210.220.230.240.250.260.270.280.

290.300.

Summary:

Num frames encoded = 300

Encoding Time = 1.30 sec, 230.40 fps

Overall Time = 1.68 sec, 178.32 fps

Average CPU usage = 49.53%

Encoded Size = 180998 bytes

Compression Ratio = 63.01

EncodedSize/ExpectedSize = 1.45

up to P4x4

< SPAN>Starting H264 encoding foreman_qcif_30Hz.yuv to file test.264

Source video width = 176, height = 144, frameRate = 30.00

Max frames to encode = 300

Encoding bit rate = 100000 bits per second

0.10.20.30.40.50.60.70.80.90.100.110.120.130.140.150.160.170.180.190.200.210.220.230.240.250.260.270.280.

290.300.

Summary:

Num frames encoded = 300

Encoding Time = 1.56 sec, 191.90 fps

Overall Time = 1.91 sec, 156.77 fps

Average CPU usage = 50.00%

Encoded Size = 177089 bytes

Compression Ratio = 64.40

EncodedSize/ExpectedSize = 1.42

Test with V 5.2

up to P8x8

1. fixed QP=30 , Inter partitions: 16x16, 8x16, 16x8 and 8x8.

D:IPP_test>h264enc_ipp.exe encoder_P8x8_QP30.par test.264
Option 'encoder_P8x8_QP30.par'
Option 'test.264'

Starting H264 encoding foreman_qcif_30Hz.yuv to file test.264
Source video width = 176, height = 144, frameRate = 30.00
Max frames to encode = 300
Encoding bit rate = 100000 bits per second
0.10.20.30.40.50.60.70.80.90.100.110.120.130.140.150.160.170.180.190.200.210.220.230.240.250.260.270.280.
290.300.
Summary:
Num frames encoded = 300
Encoding Time = 0.56 sec, 533.90 fps
OverallTime = 0.68 sec, 441.84 fps
Average CPU usage = 50.00%
Encoded Size = 188934 bytes
Compression Ratio = 60.36
EncodedSize/ExpectedSize = 1.51

up to P4x4
2. fixed QP=30 , Inter partitions: 16x16, 8x16, 16x8, 8x8, 8x4,4x8,4x4

D:IPP_test>h264enc_ipp.exe encoder_P4x4_QP30.par test.264
Option 'encoder_P4x4_QP30.par'
Option 'test.264'

Starting H264 encoding foreman_qcif_30Hz .yuv to file test.264
Source video width = 176, height = 144, frameRate = 30.00
Max frames to encode = 300
Encoding bit rate = 100000 bits per second
0.10.20.30.40.50.60.70.80.90.100.110.120.130.140.150.160.170.180.190.200.210.220.230.240.250.260.270.280 .
290.300.
Summary:
Num frames encoded = 300
Encoding Time = 12.74 sec, 23.55 fps
OverallTime = 12.88 sec, 23.30 fps
Average CPU usage = 49.94%
Encoded Size = 173237 bytes
Compression Ratio = 65.83
EncodedSize/ExpectedSize = 1.39


3. Bitrate =100 Kbitps , Inter partitions: 16x16, 8x16, 16x8 and 8x8.

D:IPP_test>h264enc_ipp.exe encoder_P8x8_B100Kbitps.par test.264
Option 'encoder_P8x8_B100Kbitps.par'
Option 'test.264'

Starting H264 encoding foreman_qcif_30Hz.yuv to file test.264
Source video width = 176, height = 144, frameRate = 30.00
Max frames to encode = 300
Encoding bit rate = 100000 bits per second
0.10.20.30.40.50.60.70.80.90.100.110.120.130.140.150.160.170.180.190.200.210.220.230.240.250.260.270.280.
290.300.
Summary:
Num frames encoded = 300
Encoding Time = 0.51 sec, 585.45 fps
OverallTime = 0.65 sec, 464.67 fps
Average CPU usage = 50.00%
Encoded Size = 126833 bytes
Compression Ratio = 89.92
EncodedSize/ExpectedSize = 1.01

4. Bitrate =100 Kbitps , Inter partitions: 16x16, 8x16, 16x8, 8x8,
8x4,4x8, and 4x4.

D:IPP_test>h264enc_ipp.exe encoder_P4x4_B100Kbitps.par test.264
Option 'encoder_P4x4_B100Kbitps.par'
Option 'test.264'

Starting H264 encoding foreman_qcif_30Hz.yuv to file test.264
Source video width = 176, height = 144, frameRate = 30.00
Max frames to encode = 300
Encoding bit rate = 100000 bits per second
0.10.20.30.40.50.60.70.80.90.100.110.120.130.140.150.160.170.180.190.200.210.220.230.240.250.260.270.280 .
290.300.
Summary:
Num frames encoded = 300
Encoding Time = 11.11 sec, 27.01 fps
OverallTime = 11.24 sec, 26.70 fps
Average CPU usage = 50.00%
Encoded Size = 126881 bytes
Compression Ratio = 89.89
EncodedSize/ExpectedSize = 1.02

0 Kudos
andrewk88
Beginner
1,089 Views

Hi Nikolay,

The performance for 8x8 partitioningbecame worse in IPP 5.3 but quality is increased significantly.

Also you can see that for 4x4 partitioning the speed is much more better than in 5.2 and of course quality is better.

That's interesting and positive change. About a year ago, I had submitted this problem as "issue" to Intel support. Last year we hadexperimented a lot with differentencoders (incl. our own) and learnt that fora moderate bitrate ranges (<= 2Mbps for D1 resolution) you can rarely get a non-marginal improvements in quality and bitrates with enabled subpartitioning(either 8x8 or 4x4) while employing a simplified mode decision scheme in the form ofJ = D + lambda*R.

It would interesting, if you couldprovide more detailsabout your experiments.

Best regards,

Andrew

PS. I don't know the latest code changes

0 Kudos
evgenyka
Beginner
1,089 Views

Hello Nikolay,

Thank you for your answer!

However, I still have one more question:

What I should make (i.e. in MSVS 2005 or in C/C++ code) forcing h264 encoder to use multiple CPUs?

Best regards,

Evgeny

0 Kudos
evgenyka
Beginner
1,089 Views

Hi Nikolay,

Thank you for your help, it works now.

I still have two questions:

1. When OpenMP is selected I start to receive many warnings as follows:

1>c1xx : warning C4005: '_OPENMP' : macro redefinition

1> command-line arguments: see previous definition of '_OPENMP'

Do you know how to rid of it?

2. Is there a way to use multiple threads with a single slice?

Thank you in advance.

Evgeny

0 Kudos
llx
Beginner
1,089 Views

According to readme.htm, the valid values of profile are 77(main) and 100(high) only. When I tried put 66, the output is 100. How do I configure to baseline? Thanks.

0 Kudos
Reply