- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Evaluating IPP for mpeg code/decode. New to IPP.
Our platform is Intel Dual quad core X5650 under RedHat 5.5 (24 cpu's) using gcc 4.1.2
We are using (currently) ffmpeg to decode movies but want to conform all decode into UYVY packed
We are using IPP to format convert from YUV420p, YUV422p etc etc into CbYCr packed
We are also resizing video and are using ippiResizeYUV422_8u_C2R to do this.
However this decodes into YUVY - so we use ippiYCbCr422ToCbYCr422_8u_C2R to transform into
UYVY. My questions are:
1) Is there a Resize that we can use that resizes direct into CbYCr?
2) The Resize is slow even for Nearest Neighbour (4-5msec for HD) and unusable for Cubic. I notice that even tho we are linking in threaded libraries no threading takes place - is this expected for these functions. I also see no difference in calling ippInit (or ippStaticInit) or not - which makes me suspicios that we are (for some reason) not even calling optimised functions.
3) Can anyone suggest what strategy we should use when resizing intelaced video - is it normal to de-interlace before resizing or is there another way of dealing with this
Any help/comments or abuse is welcome.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Cheers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1) Is there a Resize that we can use that resizes direct into CbYCr?
No, we don't have a function that resizes direct into CbYCr
2) The Resize is slow even for Nearest Neighbour (4-5msec for HD) and unusable for Cubic. I notice that even tho we are linking in threaded libraries no threading takes place - is this expected for these functions. I also see no difference in calling ippInit (or ippStaticInit) or not - which makes me suspicios that we are (for some reason) not even calling optimised functions.
Are you linking statically? ippInit is only necessary if you are linking statically; if you are linking dynamically you do not need to call any initialization function. Only about twenty percent of the functions in the Intel IPP shared and static threaded librarys are actually threaded. There is a file called ThreadedFunctionsList.txt in the Intel IPP documentation that lists the functions which are threaded, and the functions you mentioned above are not listed in that file. You can use a TBB or Cilk Plus or OpenMP wrapper to thread primitive functions in some cases so that might be an option to thread those functions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Steve,
you can use the flow function to check the optimized version you used:
ippiGetLibVersion()
The function will return the the version you used. You can learn which optimized version you used.
Also, you may consider to use the following two functions to resize the image.
ippiResizeSqrPixel() //Resize first. if there is three YUV plane, may need to call three times
ippiYCrCb420ToCbYCr422_8u_P3C2R //color conversion.
ippiResizeSqrPixel is a more optimzed function for performance.
Thanks,
Chao
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thats v useful - I will take a look at the ResizeSqr.
I did get to see what version I was calling - in a crash!.
Seems it is using y8 - which is correct for the CPU's I am running.
So back to my original question - I see no difference in speed
regardless of whether I call ippInit or not.
If I run the t2.cpp example:
/* static non-threaded lib
g++ t2.cpp -I /opt/intel/composerxe-2011.4.191/ipp/include -o t2 \
/opt/intel/composerxe-2011.4.191/ipp/lib/intel64/libipps_t.a \
/opt/intel/composerxe-2011.4.191/ipp/lib/intel64/libippcore_t.a \
/opt/intel/composerxe-2011.4.191/compiler/lib/intel64/libiomp5.a -lpthread
static threaded
g++ t2.cpp -I /opt/intel/composerxe-2011.4.191/ipp/include -o t2 \
/opt/intel/composerxe-2011.4.191/ipp/lib/intel64/libipps_l.a \
/opt/intel/composerxe-2011.4.191/ipp/lib/intel64/libippcore_l.a
dynamic
g++ t2.cpp -I /opt/intel/composerxe-2011.4.191/ipp/include -o t2 -L /opt/intel/composerxe-2011.4.191/ipp/lib/intel64 -lippcore -lipps -lpthread
*/
#include
#include
int main()
{
const int N = 20000, loops = 100;
Ipp32f src
unsigned int seed = 12345678, i;
Ipp64s t1,t2;
/// no StaticInit call, means PX code, not optimized
ippsRandUniform_Direct_32f(src,N,0.0,1.0,&seed);
t1=ippGetCpuClocks();
for(i=0; i
t2=ippGetCpuClocks();
printf("without StaticInit: %.1f clocks/element\n",(float)(t2-t1)/loops/N);
ippInit();
t1=ippGetCpuClocks();
for(i=0; i
t2=ippGetCpuClocks();
printf("with StaticInit: %.1f clocks/element\n",(float)(t2-t1)/loops/N);
return 0;
}
a) static non threaded i get
without StaticInit: 1.4 clocks/element
with StaticInit: 2.8 clocks/element
b) static threaded
without StaticInit: 1.4 clocks/element
with StaticInit: 2.3 clocks/element
c) dynamic
without StaticInit: 2.5 clocks/element
with StaticInit: 1.0 clocks/element
The only one that is faster with static init is the dynamically loaded one!
Am I missing soemthing here?
Steve
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Steve,
The minimum instruction set supported in IPP has changed since Intel IPP 7.0; for instance it is now SSE3 on the 64 bit version of Intel IPP, so the ippSqrt function will use at least SSE3 and the performance is probably not that different from that version to the y8 (SSE4.1,4.2, AESNI) version.
See this article:
http://software.intel.com/en-us/articles/understanding-simd-optimization-layers-and-dispatching-in-the-intel-ipp-70-library/
In addition I built your code and see the decrease in clocks per element after calling ippInit in the dynamically loaded version but also see the same speed up after commenting out the ippInit; the differential isduesomething else.
For the static version I did not see an increase in the number of clocks per elementafter calling ippInit. You should probably separate the experiments into different program runs to eliminate any cache warming or other effects.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Steve,
Another notes on the benchmark code is the data alignment. Since each element only need 1 or 2 CPU clock tickets. Memory access becomes the important factor on the performance. For src/dst data, you can use ippsMalloc_ to allocate the aligned data, so the test code could have similar performance behavior from run to run.
Thanks,
Chao

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page