Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

IPP H264 performance

peter2
Beginner
463 Views
Hi!

I've evaluated H264 routines from IPP 4.0 trial on PC, and have found than performance doesn't increase compared to my C code. Since I've used merged libs, and called directly w7_ and a6_ routines, as well as px_, this basically means that all three contain equivalent code.
So, my questions:
1. I'm sure that optimized H264 routines will be available very soon for all platforms. Could you give me any hint when?
2. Will these routines be available for regular XScale and WMMX?
3. Is there any way to participate in pre-release code testing?

Thanks in advance!

Peter
0 Kudos
7 Replies
Ying_S_Intel
Employee
463 Views
Hi, Peter,

To get general H.264 Performance you may run the Intel IPP performance benchmark tool "perfsys" located in directory ipp40 oolsperfsys. You can choose the ps_ippvc.exe to run to get the H.264 performance data on your target system.

We will consider H.264 support for Intel XScale and WMMX in future releases as well, you may periodically check our web site at http://www.intel.com/software/products/ipp for update.

If you are interested in participating the pre-release test, please submit a request under Intel IPP productsvia Intel Premier Support.

Thanks,
Ying S
Intel Corp.
0 Kudos
peter2
Beginner
463 Views
Thanks! I'll put a request.
0 Kudos
marc_ba
Beginner
463 Views

Hello,

Should you try to run this test, it would be great to share the results here if you have time ...

Thanks a lot

Marc

0 Kudos
peter2
Beginner
463 Views
I've run the tests. Nothing new. See my results in attachement. Similar results you can find in tools/perfsys/data. For instance, look at worst-case horizontal quarter-pixel interpolation for luma and regular 8x8 idct for comparison.

ps_ippvcpx.csv:
CPU,Intel Pentium 4 Processor HT 2x3192 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=16,35,px,0.719
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,60,e,4.84
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,54,e,4.34
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,41,e,3.32
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,55,e,4.45
-----------------------

ps_ippvca6.csv:
CPU,Intel Pentium 4 Processor HT 2x3192 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=16,11,px,0.236
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,57,e,4.58
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,57,e,4.58
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,3.45
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,53,e,4.25
-----------------------

ps_ippvcw7.csv:
CPU,Intel Pentium 4 Processor HT 2x3192 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=32,10,px,0.214
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,61,e,4.89
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,51,e,4.12
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,3.43
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,52,e,4.25
-----------------------

ps_ippvct7.csv:
CPU,Intel Pentium 4 Processor HT 1x2128 MHz, L1=8/12K, L2=1024K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=32,12,px,0.365
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,25,e,3.03
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,41,e,5.03
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,5.08
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,41,e,5.03
-----------------------

my box:
CPU,Intel Pentium 4 Processor HT 2x2594 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=16,10,px,0.256
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,61,e,6.04
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,51,e,5.13
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,4.24
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,52,e,5.2
-----------------------

To my mind, these results are unambiguous.
0 Kudos
Vladimir_Dudnik
Employee
463 Views

You can see the performance of ippiDCT function has improvement onlatest architectures. It is because this function was tightly optimizaed by hand on assemble level. Yes, you are right, the performance of ippiInterpolateLuma_H264 does not show performance gain, it is because this function initially was optimized in C code, now we work onoptimization of this function on assemble level. You will see improved performance in the next version of libraries.

Regards,
Vladimir

0 Kudos
peter2
Beginner
463 Views
I've tried 4.1 beta and I can confirm luma and chroma interpoltation were MMX Ext and SSE2. As well as deblocking. Dequant was not MMX enhaced. I haven't tried 4.1 release for x86 yet, but 4.1 release for XScale doesn't differ form 4.1 beta for XScale too much. At least I haven't noticed any differences in H264 part.
0 Kudos
Vladimir_Dudnik
Employee
463 Views
Hi,
You are right, this function has 15 different branches inside. Each branch has their special conditions and was optimized separately. So, we still work on some of branches and we are hoping we will improve this function in future.
Regards,
Vladimir
0 Kudos
Reply