- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi!
I've evaluated H264 routines from IPP 4.0 trial on PC, and have found than performance doesn't increase compared to my C code. Since I've used merged libs, and called directly w7_ and a6_ routines, as well as px_, this basically means that all three contain equivalent code.
So, my questions:
1. I'm sure that optimized H264 routines will be available very soon for all platforms. Could you give me any hint when?
2. Will these routines be available for regular XScale and WMMX?
3. Is there any way to participate in pre-release code testing?
Thanks in advance!
Peter
I've evaluated H264 routines from IPP 4.0 trial on PC, and have found than performance doesn't increase compared to my C code. Since I've used merged libs, and called directly w7_ and a6_ routines, as well as px_, this basically means that all three contain equivalent code.
So, my questions:
1. I'm sure that optimized H264 routines will be available very soon for all platforms. Could you give me any hint when?
2. Will these routines be available for regular XScale and WMMX?
3. Is there any way to participate in pre-release code testing?
Thanks in advance!
Peter
Link Copied
7 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Peter,
To get general H.264 Performance you may run the Intel IPP performance benchmark tool "perfsys" located in directory ipp40 oolsperfsys. You can choose the ps_ippvc.exe to run to get the H.264 performance data on your target system.
We will consider H.264 support for Intel XScale and WMMX in future releases as well, you may periodically check our web site at http://www.intel.com/software/products/ipp for update.
If you are interested in participating the pre-release test, please submit a request under Intel IPP productsvia Intel Premier Support.
Thanks,
Ying S
Intel Corp.
We will consider H.264 support for Intel XScale and WMMX in future releases as well, you may periodically check our web site at http://www.intel.com/software/products/ipp for update.
If you are interested in participating the pre-release test, please submit a request under Intel IPP productsvia Intel Premier Support.
Thanks,
Ying S
Intel Corp.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks! I'll put a request.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Should you try to run this test, it would be great to share the results here if you have time ...
Thanks a lot
Marc
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've run the tests. Nothing new. See my results in attachement. Similar results you can find in tools/perfsys/data. For instance, look at worst-case horizontal quarter-pixel interpolation for luma and regular 8x8 idct for comparison.
ps_ippvcpx.csv:
CPU,Intel Pentium 4 Processor HT 2x3192 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=16,35,px,0.719
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,60,e,4.84
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,54,e,4.34
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,41,e,3.32
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,55,e,4.45
-----------------------
ps_ippvca6.csv:
CPU,Intel Pentium 4 Processor HT 2x3192 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=16,11,px,0.236
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,57,e,4.58
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,57,e,4.58
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,3.45
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,53,e,4.25
-----------------------
ps_ippvcw7.csv:
CPU,Intel Pentium 4 Processor HT 2x3192 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=32,10,px,0.214
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,61,e,4.89
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,51,e,4.12
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,3.43
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,52,e,4.25
-----------------------
ps_ippvct7.csv:
CPU,Intel Pentium 4 Processor HT 1x2128 MHz, L1=8/12K, L2=1024K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=32,12,px,0.365
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,25,e,3.03
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,41,e,5.03
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,5.08
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,41,e,5.03
-----------------------
my box:
CPU,Intel Pentium 4 Processor HT 2x2594 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=16,10,px,0.256
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,61,e,6.04
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,51,e,5.13
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,4.24
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,52,e,5.2
-----------------------
To my mind, these results are unambiguous.
ps_ippvcpx.csv:
CPU,Intel Pentium 4 Processor HT 2x3192 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=16,35,px,0.719
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,60,e,4.84
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,54,e,4.34
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,41,e,3.32
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,55,e,4.45
-----------------------
ps_ippvca6.csv:
CPU,Intel Pentium 4 Processor HT 2x3192 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=16,11,px,0.236
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,57,e,4.58
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,57,e,4.58
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,3.45
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,53,e,4.25
-----------------------
ps_ippvcw7.csv:
CPU,Intel Pentium 4 Processor HT 2x3192 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=32,10,px,0.214
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,61,e,4.89
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,51,e,4.12
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,3.43
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,52,e,4.25
-----------------------
ps_ippvct7.csv:
CPU,Intel Pentium 4 Processor HT 1x2128 MHz, L1=8/12K, L2=1024K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=32,12,px,0.365
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,25,e,3.03
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,41,e,5.03
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,5.08
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,41,e,5.03
-----------------------
my box:
CPU,Intel Pentium 4 Processor HT 2x2594 MHz, L1=8/12K, L2=512K
...
ippiDCTInv_8x8,16s8u,-,8x8,-,-,-,-,-,nLps=16,10,px,0.256
...
ippiInterpolateLuma_H264,8u,C1R,16,16,3,0,-,-,nLps=16,61,e,6.04
ippiInterpolateLuma_H264,8u,C1R,16,16,3,1,-,-,nLps=16,51,e,5.13
ippiInterpolateLuma_H264,8u,C1R,16,16,3,2,-,-,nLps=16,42,e,4.24
ippiInterpolateLuma_H264,8u,C1R,16,16,3,3,-,-,nLps=16,52,e,5.2
-----------------------
To my mind, these results are unambiguous.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can see the performance of ippiDCT function has improvement onlatest architectures. It is because this function was tightly optimizaed by hand on assemble level. Yes, you are right, the performance of ippiInterpolateLuma_H264 does not show performance gain, it is because this function initially was optimized in C code, now we work onoptimization of this function on assemble level. You will see improved performance in the next version of libraries.
Regards,
Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've tried 4.1 beta and I can confirm luma and chroma interpoltation were MMX Ext and SSE2. As well as deblocking. Dequant was not MMX enhaced. I haven't tried 4.1 release for x86 yet, but 4.1 release for XScale doesn't differ form 4.1 beta for XScale too much. At least I haven't noticed any differences in H264 part.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
You are right, this function has 15 different branches inside. Each branch has their special conditions and was optimized separately. So, we still work on some of branches and we are hoping we will improve this function in future.
Regards,
Vladimir
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page