IPP H264 performance

peter2 · ‎06-08-2004

Hi!

I've evaluated H264 routines from IPP 4.0 trial on PC, and have found than performance doesn't increase compared to my C code. Since I've used merged libs, and called directly w7_ and a6_ routines, as well as px_, this basically means that all three contain equivalent code.
So, my questions:
1. I'm sure that optimized H264 routines will be available very soon for all platforms. Could you give me any hint when?
2. Will these routines be available for regular XScale and WMMX?
3. Is there any way to participate in pre-release code testing?

Thanks in advance!

Peter

Ying_S_Intel · ‎06-15-2004

Hi, Peter,

To get general H.264 Performance you may run the Intel IPP performance benchmark tool "perfsys" located in directory ipp40 oolsperfsys. You can choose the ps_ippvc.exe to run to get the H.264 performance data on your target system.

We will consider H.264 support for Intel XScale and WMMX in future releases as well, you may periodically check our web site at http://www.intel.com/software/products/ipp for update.

If you are interested in participating the pre-release test, please submit a request under Intel IPP productsvia Intel Premier Support.

Thanks,
Ying S
Intel Corp.

peter2 · ‎06-15-2004

Thanks! I'll put a request.

marc_ba · ‎06-15-2004

Hello,

Should you try to run this test, it would be great to share the results here if you have time ...

Thanks a lot

Marc

peter2 · ‎06-15-2004

I've run the tests. Nothing new. See my results in attachement. Similar results you can find in tools/perfsys/data. For instance, look at worst-case horizontal quarter-pixel interpolation for luma and regular 8x8 idct for comparison.

ps_ippvcpx.csv:
CPU,Intel Pentium 4 Processor HT 2x3192 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=16,35,px,0.719
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,60,e,4.84
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,54,e,4.34
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,41,e,3.32
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,55,e,4.45
-----------------------

ps_ippvca6.csv:
CPU,Intel Pentium 4 Processor HT 2x3192 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=16,11,px,0.236
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,57,e,4.58
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,57,e,4.58
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,3.45
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,53,e,4.25
-----------------------

ps_ippvcw7.csv:
CPU,Intel Pentium 4 Processor HT 2x3192 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=32,10,px,0.214
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,61,e,4.89
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,51,e,4.12
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,3.43
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,52,e,4.25
-----------------------

ps_ippvct7.csv:
CPU,Intel Pentium 4 Processor HT 1x2128 MHz, L1=8/12K, L2=1024K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=32,12,px,0.365
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,25,e,3.03
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,41,e,5.03
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,5.08
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,41,e,5.03
-----------------------

my box:
CPU,Intel Pentium 4 Processor HT 2x2594 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=16,10,px,0.256
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,61,e,6.04
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,51,e,5.13
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,4.24
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,52,e,5.2
-----------------------

To my mind, these results are unambiguous.

Vladimir_Dudnik · ‎06-15-2004

You can see the performance of ippiDCT function has improvement onlatest architectures. It is because this function was tightly optimizaed by hand on assemble level. Yes, you are right, the performance of ippiInterpolateLuma_H264 does not show performance gain, it is because this function initially was optimized in C code, now we work onoptimization of this function on assemble level. You will see improved performance in the next version of libraries.

Regards,
Vladimir

peter2 · ‎10-12-2004

I've tried 4.1 beta and I can confirm luma and chroma interpoltation were MMX Ext and SSE2. As well as deblocking. Dequant was not MMX enhaced. I haven't tried 4.1 release for x86 yet, but 4.1 release for XScale doesn't differ form 4.1 beta for XScale too much. At least I haven't noticed any differences in H264 part.

Vladimir_Dudnik · ‎10-12-2004

Hi,

You are right, this function has 15 different branches inside. Each branch has their special conditions and was optimized separately. So, we still work on some of branches and we are hoping we will improve this function in future.

Regards,

Vladimir