Format Conversion

BatterseaSteve · ‎05-20-2011

Hi
Evaluating IPP for mpeg code/decode. New to IPP.
Our platform is Intel Dual quad core X5650 under RedHat 5.5 (24 cpu's) using gcc 4.1.2

We are using (currently) ffmpeg to decode movies but want to conform all decode into UYVY packed
We are using IPP to format convert from YUV420p, YUV422p etc etc into CbYCr packed
We are also resizing video and are using ippiResizeYUV422_8u_C2R to do this.
However this decodes into YUVY - so we use ippiYCbCr422ToCbYCr422_8u_C2R to transform into
UYVY. My questions are:

1) Is there a Resize that we can use that resizes direct into CbYCr?
2) The Resize is slow even for Nearest Neighbour (4-5msec for HD) and unusable for Cubic. I notice that even tho we are linking in threaded libraries no threading takes place - is this expected for these functions. I also see no difference in calling ippInit (or ippStaticInit) or not - which makes me suspicios that we are (for some reason) not even calling optimised functions.
3) Can anyone suggest what strategy we should use when resizing intelaced video - is it normal to de-interlace before resizing or is there another way of dealing with this

Any help/comments or abuse is welcome.

BatterseaSteve · ‎05-20-2011

Apologies - seem to have submitted this thread twice - firefox hung - anyone know how to delete one of them
Cheers

Joseph_S_Intel · ‎05-20-2011

Hi, I deleted the duplicate post for you.

Joseph_S_Intel · ‎05-20-2011

Hi,

1) Is there a Resize that we can use that resizes direct into CbYCr?

No, we don't have a function that resizes direct into CbYCr

2) The Resize is slow even for Nearest Neighbour (4-5msec for HD) and unusable for Cubic. I notice that even tho we are linking in threaded libraries no threading takes place - is this expected for these functions. I also see no difference in calling ippInit (or ippStaticInit) or not - which makes me suspicios that we are (for some reason) not even calling optimised functions.
Are you linking statically? ippInit is only necessary if you are linking statically; if you are linking dynamically you do not need to call any initialization function. Only about twenty percent of the functions in the Intel IPP shared and static threaded librarys are actually threaded. There is a file called ThreadedFunctionsList.txt in the Intel IPP documentation that lists the functions which are threaded, and the functions you mentioned above are not listed in that file. You can use a TBB or Cilk Plus or OpenMP wrapper to thread primitive functions in some cases so that might be an option to thread those functions.

BatterseaSteve · ‎05-22-2011

Hi

Thanks for the reply. I am still finding my way around the IPP and did not see the threaded func list. I am linking statically. So I guess my question is why I see no difference in timing regardless on whether I run ippInit - which as I say, makes me suspicious.

My understanding was that without ippInit the call would reduce to optimised C. Is there any way I can find out which cpu variation I am actually calling?

Are there any perf figures I can compare against to see if mine are in the ball park

Cheers

Steve

Chao_Y_Intel · ‎05-23-2011

Steve,

you can use the flow function to check the optimized version you used:
ippiGetLibVersion()

The function will return the the version you used. You can learn which optimized version you used.
Also, you may consider to use the following two functions to resize the image.
ippiResizeSqrPixel() //Resize first. if there is three YUV plane, may need to call three times
ippiYCrCb420ToCbYCr422_8u_P3C2R //color conversion.

ippiResizeSqrPixel is a more optimzed function for performance.

Thanks,
Chao

Thomas_Jensen1 · ‎05-24-2011

To be more clear, ippiResizeSqrPixel can utilize multiple cores (when threading is enabled in IPP), ippiResize cannot.

BatterseaSteve · ‎05-24-2011

Thanks Guys
Thats v useful - I will take a look at the ResizeSqr.
I did get to see what version I was calling - in a crash!.
Seems it is using y8 - which is correct for the CPU's I am running.

So back to my original question - I see no difference in speed
regardless of whether I call ippInit or not.

If I run the t2.cpp example:

/* static non-threaded lib
g++ t2.cpp -I /opt/intel/composerxe-2011.4.191/ipp/include -o t2 \
/opt/intel/composerxe-2011.4.191/ipp/lib/intel64/libipps_t.a \
/opt/intel/composerxe-2011.4.191/ipp/lib/intel64/libippcore_t.a \
/opt/intel/composerxe-2011.4.191/compiler/lib/intel64/libiomp5.a -lpthread

static threaded
g++ t2.cpp -I /opt/intel/composerxe-2011.4.191/ipp/include -o t2 \
/opt/intel/composerxe-2011.4.191/ipp/lib/intel64/libipps_l.a \
/opt/intel/composerxe-2011.4.191/ipp/lib/intel64/libippcore_l.a

dynamic
g++ t2.cpp -I /opt/intel/composerxe-2011.4.191/ipp/include -o t2 -L /opt/intel/composerxe-2011.4.191/ipp/lib/intel64 -lippcore -lipps -lpthread
*/

#include
#include

int main()
{
const int N = 20000, loops = 100;
Ipp32f src, dst;
unsigned int seed = 12345678, i;
Ipp64s t1,t2;

/// no StaticInit call, means PX code, not optimized
ippsRandUniform_Direct_32f(src,N,0.0,1.0,&seed);
t1=ippGetCpuClocks();
for(i=0; i ippsSqrt_32f(src,dst,N);
t2=ippGetCpuClocks();
printf("without StaticInit: %.1f clocks/element\n",(float)(t2-t1)/loops/N);
ippInit();
t1=ippGetCpuClocks();
for(i=0; i ippsSqrt_32f(src,dst,N);
t2=ippGetCpuClocks();
printf("with StaticInit: %.1f clocks/element\n",(float)(t2-t1)/loops/N);
return 0;
}

a) static non threaded i get
without StaticInit: 1.4 clocks/element
with StaticInit: 2.8 clocks/element

b) static threaded
without StaticInit: 1.4 clocks/element
with StaticInit: 2.3 clocks/element

c) dynamic
without StaticInit: 2.5 clocks/element
with StaticInit: 1.0 clocks/element

The only one that is faster with static init is the dynamically loaded one!
Am I missing soemthing here?
Steve

Joseph_S_Intel · ‎05-25-2011

Hi Steve,
The minimum instruction set supported in IPP has changed since Intel IPP 7.0; for instance it is now SSE3 on the 64 bit version of Intel IPP, so the ippSqrt function will use at least SSE3 and the performance is probably not that different from that version to the y8 (SSE4.1,4.2, AESNI) version.

See this article:
http://software.intel.com/en-us/articles/understanding-simd-optimization-layers-and-dispatching-in-the-intel-ipp-70-library/

In addition I built your code and see the decrease in clocks per element after calling ippInit in the dynamically loaded version but also see the same speed up after commenting out the ippInit; the differential isduesomething else.

For the static version I did not see an increase in the number of clocks per elementafter calling ippInit. You should probably separate the experiments into different program runs to eliminate any cache warming or other effects.

Chao_Y_Intel · ‎05-27-2011

Steve,

Another notes on the benchmark code is the data alignment. Since each element only need 1 or 2 CPU clock tickets. Memory access becomes the important factor on the performance. For src/dst data, you can use ippsMalloc_ to allocate the aligned data, so the test code could have similar performance behavior from run to run.

Thanks,
Chao