Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

H.264 performence issues in IPP 7.0.2

PhilipH
Beginner
580 Views

Hello,

after changing from IPP version 6.1 to 7.0.2 we encountered a performance issue with our video decoder implementation.

We implemented the H.264 decoder functionality as a DLL using static linking. Therefore we changed the project settings to use the new renamed libraries without multithreading.

The decoding performance of this new DLL is about 4 times slower than the performance of the old DLL which uses IPP 6.1. This performance issue however only occurs when decoding h.264 video data. Our implementation of the MPEG-4 decoder shows the same performance with both IPP versions. Ippinit is called on startup.

Do you have any idea what the cause of this problem might be?

Philip

0 Kudos
10 Replies
shyaki
Beginner
580 Views
How many threads are used in decoding? Are all the CPUs busy during decompression? Can you print out the results of ippiGetLibVersion()?

0 Kudos
PhilipH
Beginner
580 Views

Hi,

we compiled the ipps in both cases without openmp (using static linking with serial parameter).

We tried also the version 7.0.1. There is the same performance issue.

Here is the result of ippiGetLibVersion:

Intel Integrated Performance Primitives

version: 7.0 build 205.40, [7.0.1037.205]

name: ippiw7_l.lib

date: Jan 5 2011

we are feeding the decoder directly with raw h.264 data without rendering the frames. The cpu usage of a core i7 PC while running our testapplication is about 3 percent with the old DLL and about 15% with the new one.

Philip

0 Kudos
PhilipH
Beginner
580 Views
Hi again,

we tested the h.264 decoder again by using the simpleplayer application compiled with version 6.1.1 and 7.0.2. we compared the cpu usage of both players using the attached h.264 file. The result was the same performance difference as in our application.

Can you reproduce this?



0 Kudos
Gennady_F_Intel
Moderator
580 Views
Not, I cannot to reproduce, but I checked on different CPU type [Core2 Duo]
the Resulst which I got:

simple_player.exe rec.264
Video Render : NULL
-RenderFormat: YV12
Stream Type : H264PV
Video Info :
-Video Type : H264
-Resolution : 1280x960
-Frame Rate : 15.00
=== ipp 6.1 Update5:
DecRate:71.19 fps(Dec 11.05ms/f + Conv 3.00ms/f = 14.05ms/f) RndrRate 15.01fps Audio Dec 0.00chnls
DecRate:79.02 fps(Dec 10.97ms/f + Conv 1.68ms/f = 12.66ms/f) RndrRate 14.98fps Audio Dec 0.00chnls
DecRate:79.10 fps(Dec 11.04ms/f + Conv 1.60ms/f = 12.64ms/f) RndrRate 15.00fps Audio Dec 0.00chnls
DecRate:80.50 fps(Dec 10.88ms/f + Conv 1.55ms/f = 12.42ms/f) RndrRate 15.00fps Audio Dec 0.00chnls

and the same with IPP 7.0.2 ( bundled with Comper XE-2011)
DecRate:113.93 fps(Dec 5.18ms/f + Conv 3.60ms/f = 8.78ms/f) RndrRate 15.07fps Audio Dec 0.00chnls
DecRate:358.07 fps(Dec 1.10ms/f + Conv 1.70ms/f = 2.79ms/f) RndrRate 15.00fps Audio Dec 0.00chnls
DecRate:411.53 fps(Dec 0.83ms/f + Conv 1.60ms/f = 2.43ms/f) RndrRate 15.00fps Audio Dec 0.
DecRate:418.73 fps(Dec 0.77ms/f + Conv 1.62ms/f = 2.39ms/f) RndrRate 15.00fps Audio Dec 0.00chnls
see the decoding rate for 7.0.2 much higher vs 6.1.5
it make sense to check the behaivior on diffferent CPU type like you use.
0 Kudos
PhilipH
Beginner
580 Views

Hi,
these are the results of our test on a core i5 PC. It looks similar to your results, but our CPU usage asdisplayedin the Taskmanager isin test #2 (IPP 7.0.2)much higher than in test #1 (IPP 6.1.1).

simple_player_611_cl8.exe rec.264

Video Render : NULL

-RenderFormat: YV12

Stream Type : H264PV

Video Info :

-Video Type : H264

-Resolution : 1280x960

-Frame Rate : 15.00

DecRate:135.06 fps(Dec 6.80ms/f + Conv 0.60ms/f = 7.40ms/f) RndrRate 14.95fps Audio Dec 0.00chnls DecRate:182.83 fps(Dec 4.97ms/f + Conv 0.50ms/f = 5.47ms/f) RndrRate 15.00fps Audio Dec 0.00chnls
DecRate:186.76 fps(Dec 4.86ms/f + Conv 0.49ms/f = 5.35ms/f) RndrRate 15.00fps Audio Dec 0.00chnls
DecRate:189.24 fps(Dec 4.80ms/f + Conv 0.49ms/f = 5.28ms/f) RndrRate 15.00fps Audio Dec 0.00chnls

simple_player_702_cl8.exe rec.264

Video Render : NULL

-RenderFormat: YV12

Stream Type : H264PV

Video Info :

-Video Type : H264

-Resolution : 1280x960

-Frame Rate : 15.00

DecRate:406.46 fps(Dec 1.81ms/f + Conv 0.65ms/f = 2.46ms/f) RndrRate 14.98fps Audio Dec 0.00chnls DecRate:1118.73 fps(Dec 0.47ms/f + Conv 0.43ms/f = 0.89ms/f) RndrRate 15.00fps Audio Dec 0.00chnls
DecRate:1262.16 fps(Dec 0.38ms/f + Conv 0.41ms/f = 0.79ms/f) RndrRate 15.00fps Audio Dec 0.00chnls
DecRate:1326.14 fps(Dec 0.35ms/f + Conv 0.41ms/f = 0.75ms/f) RndrRate 15.00fps Audio Dec 0.00chnls


0 Kudos
PhilipH
Beginner
580 Views

Hello,

after further investigation we found out that the issue of the high cpu usage is caused by the "numThreads" parameter of the H.264 video decoder. Using the simple_player or our custom implementation with the "numThreads" parameter set to 1 instead of 0 the cpu usage is much lower.

(The IPP 6.1 and IPP 7.0.2 solutions are both compiled without openmp and using static linkage as mentioned before.)

Summary:

simple_player with IPP 6.1:
- Cpu usage with numThreads set to 1 is about the same as with numThreads set to 0.

- The decoding rates with numThreads set to 0 are higher

simple_player with IPP 7.0.2:
- Cpu usage with numThreads set to 1 ismuch lower than with numThreads set to 0. (with numThreads = 1 it is the same usage as IPP 6.1)

-The decoding rates with numThreads set to 0 are higher

0 Kudos
IDZ_A_Intel
Employee
580 Views
Hello Philip,

I am not sure I fully understand your conclusions:

Is it merely a question of whether or not threading is in use? Or do you see additional performance differences between 6.1 and 7.0 beyond that?

So let me write my understanding of your conclusions:

  • 6.1 using numThreads>1 is equal in perf. to 7.0 using the same number of threads > 1.
  • 6.1 using numThreads=0 means automatic choice of number of threads equal to number of HW threads. This is equal in performance to 7.0 using numThreads=0 with the same automatic choice.
  • 6.1 using numThreads=1 is equal in perf. to 7.0 using numThreads=0 (and thus equal to 6.1 with numThreads=0). This means that there is a bug in 6.1 that incorrectly translates numThreads=1 to use an automatic choice of number of threads (i.e. equal to specifying numThreads=0) instead of using a single thread. This should be easy to verify in the TaskManager and looking in the code (frankly, I have not).
Is this the correct understanding?


- Jay
0 Kudos
johnscreek
Beginner
580 Views
We found the same problem when running H264 decoder with simple player from IPP7.0 sample code.

In our westmere system with Linux, with threadNum >1, the total CPU usage is much higher compared to single thread H.264 decoding. We also found that with threadNum >1, systime is very high, while for single thread H.264 decoding, system time is almost zero.


IPP7.0 with 1 thread
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
55225 root 20 0 147m 60m 3376 S 54 0.5 0:30.57 simple_player7V

IPP7.0 with 3 threads
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
55362 root 20 0 242m 81m 4736 R 69 0.7 0:52.33 simple_player
55363 root 20 0 242m 81m 4736 R 69 0.7 0:52.62 simple_player
55364 root 20 0 242m 81m 4736 S 3 0.7 0:02.25 simple_player

There are 6 processors: Here is the info for process 0.
cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel Xeon CPU X5660 @ 2.80GHz
stepping : 2
cpu MHz : 2793.182
cache size : 12288 KB
physical id : 0
siblings : 6
core id : 0
cpu cores : 6
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips : 5586.36
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
0 Kudos
Joe_Monaco
Beginner
580 Views

A good way to demonstrate this problem is by looking at the output of the unix time command (the timing numbers coming from simple_player do not appear to be correct). For IPP6 the results look as expected for either 1 or 3 threads. For IPP7 results only make sense for 1 thread.

This is on a Westmere based platform with 64bit OpenSuse Linux distribution. Both versions of simple_player are build with gcc 4.5.1.

The command used in all cases below is as follows:

time simple_player -s -tN -fyuv_420 -vnul -anul /tmp/file.mp4

where N is either 1 or 3.


  1. Decoding with N=1
  • IPP6 results are as follows
DecRate:52.11 fps(Dec 18.38ms/f + Conv 0.81ms/f = 19.19ms/f) RndrRate 29.96fps Audio Dec 0.00chnls
DecRate:59.19 fps(Dec 16.31ms/f + Conv 0.59ms/f = 16.89ms/f) RndrRate 29.98fps Audio Dec 0.00chnls
DecRate:57.95 fps(Dec 16.68ms/f + Conv 0.57ms/f = 17.26ms/f) RndrRate 29.97fps Audio Dec 0.00chnls

real 0m9.958s
user 0m5.004s
sys 0m0.071s

  • IPP7 results are as follows
DecRate:54.49 fps(Dec 17.45ms/f + Conv 0.90ms/f = 18.35ms/f) RndrRate 30.09fps Audio Dec 0.00chnls
DecRate:67.65 fps(Dec 14.09ms/f + Conv 0.69ms/f = 14.78ms/f) RndrRate 29.98fps Audio Dec 0.00chnls
DecRate:66.61 fps(Dec 14.34ms/f + Conv 0.67ms/f = 15.01ms/f) RndrRate 29.97fps Audio Dec 0.00chnls

real 0m9.958s
user 0m4.310s
sys 0m0.058s


2. Decoding with N=3

  • IPP6 results are as follows
DecRate:178.02 fps(Dec 4.69ms/f + Conv 0.93ms/f = 5.62ms/f) RndrRate 30.01fps Audio Dec 0.00chnls
DecRate:253.50 fps(Dec 3.24ms/f + Conv 0.71ms/f = 3.94ms/f) RndrRate 29.97fps Audio Dec 0.00chnls
DecRate:256.97 fps(Dec 3.20ms/f + Conv 0.69ms/f = 3.89ms/f) RndrRate 29.97fps Audio Dec 0.00chnls

real 0m9.959s
user 0m5.435s
sys 0m0.083s

  • IPP7 results are as follows
DecRate:224.16 fps(Dec 3.48ms/f + Conv 0.98ms/f = 4.46ms/f) RndrRate 30.07fps Audio Dec 0.00chnls
DecRate:680.72 fps(Dec 0.69ms/f + Conv 0.77ms/f = 1.47ms/f) RndrRate 29.98fps Audio Dec 0.00chnls
DecRate:821.99 fps(Dec 0.46ms/f + Conv 0.76ms/f = 1.22ms/f) RndrRate 29.97fps Audio Dec 0.00chnls

real 0m9.953s
user 0m6.239s
sys 0m6.640s


In all cases the real time (i.e. "wall clock time") is the same because the rendering rate of 30 fps gates decoding process.

For IPP6 total CPU usage only increases slightly for 3 threads (5.004+.071 vs 5.435+.083) . . . consistent with small overhread for task distribution to the 3 cores.

However, IPP7 does not scale properly. The total CPU usage for IPP7 with 3 threads is ~3X the CPU usage with one thread (i.e. (6.329+6.640 for 3 thread) vs (4.310+0.058 for 1 thread)).






0 Kudos
Ying_H_Intel
Employee
580 Views
Dear All,

I heard from IPP developer team. The problem should be changed in IPP 7.1 beta. Please check the IPP 7.1 beta http://software.intel.com/en-us/forums/showthread.php?t=106105&o=a&s=lr

and let us know if any problem.

Best Regards,
Ying
0 Kudos
Reply