ippiYCbCr420ToBGR_8u_P3C3R accounts for 67% of process time

jaegermeister · ‎02-26-2008

From start to finish,

ippiYCbCr420ToBGR_8u_P3C3R

(that one routine) uses about 67% of the processor cycles of the total cycles used to read, decode, and display mpeg4 frames. I've attached an image showing how far away this routine is compared to the next several 'heavy' routines.

This is a6_ but w7_ is similar. According to the profile sampler, the process itself is using about 10% of the total CPU(s) (dual core) time. What's the problem then? I'd like multiple mpeg4 sources running, not just one. At 10% per I'm looking at maybe 5 to 10. If I could somehow improve, or bypass, this conversion I could see getting twice as many running. The RGB32 version is similar.

Is this just the way it is? Or have I hit on some weird combination?

Vladimir_Dudnik · ‎02-26-2008

Hello,

what is theversion of IPPyou've linked with? Are you use IPP dynamic libraries or static libraries? On Intel Core2Duo processor IPP should launch V8 code, not A6 or W7. What is your platform configuration?

Regards,
Vladimir

jaegermeister · ‎02-27-2008

Hello to you too. Platform? ipp5.2 static on XP. Processor is ancient AMD x2 running 2.5 GHz w/sse2 and maybe partial sse3 and I am pretty sure v8 is not suitable. I switch between a6 and w7 by include file and nothing special there at this time.

Is it normal for such high use by the color converison routine? I notice that umc gdi driver render uses its own color convesion (i.e., plain compiler source code with no special coding) and not the 420ToBGR. I am working on this to see if I can use that instead. It's not a drop-in and I have to first figure out how to get it to work correctly (today's task). In preliminary testing, though, it runs as fast as the ipp routine, but is not quite right so that's not conclusive.

I am not so quick to blame the amd processor for slow mmx, or praise its x87 speed, since it's still 67% of the entire process and that doesn't seem right.

30FPS @ 640x480 mpeg4 sp to rgb24 (or rgb32, no real difference), from device pull (ethernet) to parse to decode (including ColorConv) to render, and 67% of that effort is spent in 420ToBGR. I take it that is not expected? Tangent: why does UMC use its ownroutine: umc_gdi_YV12_to_RGB24() and umc_gdi_CookUX() [I just noticed, those are using integer not FP math - huh - so AMD's fine (relative) x87 FP performance is not a factor]

Ideas on what I should do regarding 420ToBGR from ipp are welcome. Is it possible I have a bad ipp library release? I think it's version 5.2. r57 but I haven't looked for sure. I hear war stories on 5.3 mpeg4 so I haven't looked beyond 5.2. It's working fine except for this color conversion routine.

Vladimir_Dudnik · ‎02-27-2008

Well, I think it would make sense to update to IPP 5.3 though you may do not see additional performance gain for that particular function, but you will get new functionality and the latest bug fixes.

The reason why UMC in IPP 5.2 did not use that function was quite simple, it was just not missed by sample developers that there is appropriate function in IPP already. In IPP 5.3 sample you will see that this function is called in ColorConvert method.

It is not clear what absolute time take call of this function on average from profiler report. If you can measure absolute time this function takes to process frame then you can compare these numbers to some extend with performance measurement system which comes with IPP release (you may find it in your IPP/tools/perfsys folder).

By the way, it also may be related to non-cacheable memory (if you call this function directly on video memory). In other words, basically it is not expected that IPP optimized color convertion function will take most of the time for MPEG4 decoding pipeline. So you need to analize conditions in which this function called.

Vladimir

jaegermeister · ‎02-27-2008

Yes, when faced with such a problem that seems like the first course. But, investigating again I see my supposed "w7 is no better" result is not well investigated. I had only assumed that was the case since XP taskmgr showed CPU similar when running either a6 or w7, but this only shows what I've seen before: taskmgr can sometimes be fooled (maybe when thread leaves/blocks before its quantum is due? dunno). I put the w7_ release on the profiler and it showed an improvement, and about what I am looking for: it's now twice as fast as before, the entire process. That is,instead of 5-10 streams I could, in theory, do 10-20 streams.

I've attached a w7_ profile sampler result which could be compared against the original, but keep in mind the run times were different so the TOTAL "samples hit" count can't be directly compared between the two, only the relative counts in the module shown (for example, compare the first two routines in each of the attached pics: the one attached here and the one attached to the first).In the end, a6_ is quite a dog, at least on this CPU.

For the record, the px_ version was a lot faster than the a6_.

a6_ was 7300 for 420ToBGR and 1200 for fastcopy_I
px_ was 4125 for 420ToBGR and 1000 for fastcopy_I
w7_ was 1200 for 420ToBGR and 1100 for fastcopy_I

So, if you must use a6_, do something like this after your regular ipp include:

#undef

ippiYCbCr420ToBGR_8u_P3C3R
#define ippiYCbCr420ToBGR_8u_P3C3R px_ippiYCbCr420ToBGR_8u_P3C3R

This will work even if all the other routines are a6_.

Vladimir_Dudnik · ‎02-27-2008

Thanks for updating on this. By the way, I would recommend you to find a chance to test this on modern Intel Core2 Duo architecture, I believe you will be excited to see how much the latest Intel architecture is more efficient in comparison with previous processor's generations.

Regards,
Vladimir

jaegermeister · ‎02-29-2008

vdudnik:

It is not clear what absolute time take call of this function on average from profiler report. If you can measure absolute time this function takes to process frame then you can compare these

2008 02 19 07:36 1,163,264 ps_ippcc.exe

 Len/Size Comment Clks per Time (usec)

AMD x2 (Manchester) 2.5GHz L1=64/64KB L2=512KB
ippiYCbCr420ToBGR 8u P3C3R 720x480 - - - - - nLps=4 6.74 px 931 - (w7)

(from perfsys folder .csv)
Core 2 Quad processor 8x2400 MHz L1=32/32K
ippiYCbCr420ToBGR 8u P3C3R 720x480 - - - - - nLps=4 3.88 px 559 - (w7)
ippiYCbCr420ToBGR 8u P3C3R 720x480 - - - - - nLps=8 3.79 px 546 - (v8)
ippiYCbCr420ToBGR 8u P3C3R 720x480 - - - - - nLps=16 1.89 px 259 - (p8)

ippiYCbCr420ToBGR 8u P3C3R 720x480 - - - - - nLps=4 24.40 px 4960 - (a6)
ippiYCbCr420ToBGR 8u P3C3R 720x480 - - - - - nLps=4 42.50 px 6120 - (px)

Not too sure about the 24.4/4960 and 42.5/6120. The times don't jibe with the clks/px
since 42.5 is 1.75x 24.4 but 6120 is only 1.23x 4960.  hm

Vladimir_Dudnik · ‎02-29-2008

Thanks,

so you see, code optimized specifically for Core2 architecture takes 3..4 cpu clocks per pixel andcode optimizedfor previous generation of processors(A6)or generic C code(PX) take about 10X more.

Note also additional almost 2X performance gain for the latest Intel 45nm Core2 processor Penryn (P8).

Vladimir