1) Is there a Resize that we can use that resizes direct into CbYCr?
No, we don't have a function that resizes direct into CbYCr
2) The Resize is slow even for Nearest Neighbour (4-5msec for HD) and unusable for Cubic. I notice that even tho we are linking in threaded libraries no threading takes place - is this expected for these functions. I also see no difference in calling ippInit (or ippStaticInit) or not - which makes me suspicios that we are (for some reason) not even calling optimised functions.
Are you linking statically? ippInit is only necessary if you are linking statically; if you are linking dynamically you do not need to call any initialization function. Only about twenty percent of the functions in the Intel IPP shared and static threaded librarys are actually threaded. There is a file called ThreadedFunctionsList.txt in the Intel IPP documentation that lists the functions which are threaded, and the functions you mentioned above are not listed in that file. You can use a TBB or Cilk Plus or OpenMP wrapper to thread primitive functions in some cases so that might be an option to thread those functions.
you can use the flow function to check the optimized version you used:
The function will return the the version you used. You can learn which optimized version you used.
Also, you may consider to use the following two functions to resize the image.
ippiResizeSqrPixel() //Resize first. if there is three YUV plane, may need to call three times
ippiYCrCb420ToCbYCr422_8u_P3C2R //color conversion.
ippiResizeSqrPixel is a more optimzed function for performance.
The minimum instruction set supported in IPP has changed since Intel IPP 7.0; for instance it is now SSE3 on the 64 bit version of Intel IPP, so the ippSqrt function will use at least SSE3 and the performance is probably not that different from that version to the y8 (SSE4.1,4.2, AESNI) version.
See this article:
In addition I built your code and see the decrease in clocks per element after calling ippInit in the dynamically loaded version but also see the same speed up after commenting out the ippInit; the differential isduesomething else.
For the static version I did not see an increase in the number of clocks per elementafter calling ippInit. You should probably separate the experiments into different program runs to eliminate any cache warming or other effects.
Another notes on the benchmark code is the data alignment. Since each element only need 1 or 2 CPU clock tickets. Memory access becomes the important factor on the performance. For src/dst data, you can use ippsMalloc_ to allocate the aligned data, so the test code could have similar performance behavior from run to run.