I am wondering if IPP(7.1) in general and ippiWarpAffine* in special, does take advantage of TBB's parallel_for and if yes what is the way to enable it. When I did enabled TBB on OpenCV I got a significant speed boost on the warpAffine().
My test images(CT medical image) are 512x512 (8u) and I am using CUBIC interpolation on a destination sizes of 1590x820, . OpenCV(with TBB) is more that 3 times faster than IPP for exactly the same AffineTransform. Is worth mentioning that I am using in both (IPP and OpenCV) cases java wrappers under linux(RH6) 64-bit. For IPP I did compile the java language support (from IPP 7.0.7) against 7.1 and I am using jipp.ip.ippiWarpAffine_8u_C1R(). From OpenCV I am using Imgproc.warpAffine().
Any ideas? Please note that I am new to IPP and TBB and I am evaluating different products in order to find a good basis for a rendering libray (64-bit - Win7, Linux, Mac). From Intel I did download Intel C++ Composer XE 2013 which bundles IPP and TBB along with IMK and intel's compiler and it seems a nice fit for us so far.
ippiWarpAffine is not internally threaded (check the Documentation\en_US\ipp\ThreadedFunctionsList.txt for
threaded function list), so it can not benefit from the internal threadings. If you want to get threading
performance, you needs to implement the high level threading by yourself with tbb, or other ways.
I did notice the ThreadedFunctionList.txt and I decomposed my affine transform into mirror, rotate, resize. Overall I got prety good results, however I am wondering if you can be a little bit more specific about how I can proceed in using TBB's parallel_for with ippiWarpAffine_8u_C1R(). Are you suggesting to decompose the source image in smaller parts (with some overlap perhaps)?