Parallelizing 2D integer FFTs using Intel IPP

shehrzad-qureshi · ‎01-02-2007

This question is probably best directed at the authors of the Intel whitepaper "Intel IPP Performance Tips and Tricks" (http://www.intel.com/support/performancetools/libraries/ipp/win/ia/sb/cs-010671.htm). My application runs on a dual-core Intel PC, and fast image filtering (with large 2D kernels) is an integral part of my processing loop. I easily observe a 3-4x speedup by performing the filtering in fixed-point and in the frequency domain (i.e. precomputing the kernel FFT, multiplying the kernel FFT by the FFT of the incoming image, and then taking the inverse FFT), however I would like the input image FFT computation to take advantage of the dual-core architecture.

I am using the ippiFFTFwd_RToPack_8u32s_C1RSfs() function to compute the FFT of theinput image. According to the list of parallelized Intel IPP functions (http://downloadfinder.intel.com/scripts-df-external/Detail_Desc.aspx?agr=N&ProductID=574&DwnldID=8643&strOSs=All&OSFullname=All%20Operating%20Systems〈=eng), this function does not at this time utilize OpenMP, although the floating-point variant does.

Therefore I am very interested in crafting my own OpenMP parallelized radix-2 decimation-in-time 2D FFT, as described on pages 5-6 of the aforementioned Tips and Tricks document (Multi-Thread Processing section). Unfortunately the sample code listed in that PDF is not very clear, as the arguments passed into the functions aren't even documented. I am assuming "tw" stands for twiddle factors. Anyways, I recall in the old Intel Image Processing Library and Signal Processing Library the APIs provided access to the twiddle factors. You need those to perform a radix 2 decimation-in-time, furthermore you need the butterfly indexing tables - the new Intel IPP does not provide access to these data elements, as far as I know. With respect to the twiddle factors, an added complication arises in that the twiddle factors need to be converted into the "packed" format.

Basically what I am asking is, is there any sample code available that illustrates how to perform a radix-2 FFT in parallel using OpenMP, and will subsequent versions of the Intel IPP library include paralellized versions of the fixed-point 2D FFT functions?

Regards,

Shehrzad Qureshi

Vladimir_Dudnik · ‎01-16-2007

Hello,

there is comment from our expert:
Ifyou use large 2D kernels,you shouldnt use ippiFilter (or compare with ippiFilter). Ifyour kernel is greater than 11x11 better to use ippiConv functionality it is already based on FFT and parallelized inside. In addition: there is no need to parallelize FFT internally for 2D processing better to run all FFTs in different threads, parallelizing on the butterfly level is too expensive and non-efficient.

Regards,
Vladimir

shehrzad-qureshi · ‎01-19-2007

Thanks for the response - makes sense except I still have a further question. I understand that for large kernels it makes sense to use ippiConv (FYI my kernels are 21x21). If I were to use ippiConv, where I am repeatedly filtering my incoming video frames by these large kernels, won't ippiConv be continually recalculating the FFT of the kernel? Were that to be true, I would think I would quickly lose a major part of the advantage with frequency-domain filtering (i.e. the part where one can pre-compute the FFT of the kernel).

Thanks again for youe help,

-SQ

Vladimir_Dudnik · ‎01-20-2007

Hi,

there is comment from our expert:

Let us consider filtering of frames with 640x480 size. Filter size is 21x21 rounding to the nearest power of 2 leads to 32x32. In ippiConv is used blocking algorithm, and block size is determined by kernel size for this particular case it is 32x32 (or may be 64x64 it depends on data type and platform). A number of blocks thus is (640/32)x(480/32) = 20x15 = 300 (or 75 for 64x64 block size). So it is clear that time required for FFT from filtering kernel is at least less than 100/75 = 1.3% (we also should take into account that in addition to forward FFT from filtering kernel each image block requires multiplication in freq domain and inverse FFT therefore these 1.3% must be divided at least by 2) so for 640x480 frames and 64x64 block size forward FFT from filtering kernel takes less than 1% of the total filtering time. Is 1% a critical performance issue? For 1280x960 frame size overhead will be less than 100/2x300 ~0.17%...

Regards,
Vladimir

Sharschchandra_Bidar · ‎09-15-2010

Hi Vladimir,

I happened to come across this post, as I am "illegal instruction" error for kernel sizes greater than 10x10 when used with ippiConvFull_8u_C1R function.

I ran a for-loop which varied the kernel sizes from 1x1 to 10x10, but when the kernel size reached 11x11 I got "illegal instruction error".

When I debugged using GDB, I see that the illegal instruction error was raised from "ipps_cRadix4FwdNorm_32fc () from /usr/local/lib/libcv.so.2.0"

Can you tell me what I might be doing wrong?

Thanks,
Sharsch

Vladimir_Dudnik · ‎09-15-2010

Hi Sharsch,

could you please provide a simple test case in order for us to take deep look into the issue? An additional info could also help. What version of IPP do you use, how do you link with IPP, what is your target cpu (32-bit or 64-bit). What version of OpenCV do you use. How do you build OpenCV.

Regards,
Vladimir

Ying_H_Intel · ‎09-20-2010

Hi Sharsch,

The problem seemsrelavent toOpenCV library built with IPP support on Atom processor and em64t ipp library. If use n8 code and remove the opencv library, then everything is ok. A small test case will be helpful to locate the problem.

Best Regards,
Ying