Which is the fastest way to use the IPPS FFT ?

Reppel__Niklas · ‎02-20-2018

Hi,

i've been recently porting a convolution library that uses Apple's vDSP to IPP to make it cross-platform capable.

In general, i made it work, the results seem fine, but i somewhat stunned that it is significantly slower than vDSP on the same machine (compared over various different blocksizes).

The platform i'm currently working on is OSX El Capitan, with clang as a compiler, on an Intel i7 (broadwell) processor.

I'm currently the using real-valued in-place FFT. Using ippInit() or not doesn't seem to change anything.

I was wondering if i'm making a mistake somewhere, if there's anything i've overlooked ?

Best,

n

Gennady_F_Intel · ‎02-20-2018

yes, using IppInit will dispatch the appropriate IPP code.

What do you mean by "significant» slower? What is the problem size? and what is the version of IPP do you use?

You may try to check the IPP Perf system tool (IPPROOT\tool\intel64\perfsys) ) to check the right performance numbers for this ipps function.

Reppel__Niklas · ‎02-20-2018

I'm benchmarking real-valued FFTs over blocksizes between 4 and 16384 points, orders 2 to 16. Library version, as determined by ps_ipps: ippSP AVX2 (l9), 2017.0.3 (r55431), Apr 12 2017 "significantly" means the ipp version needs on average about twice the time.

Adriaan_van_Os · ‎02-21-2018

I have been using 2D Fast FFT from Apple DSP, from the Apple MatrixFFT package, from IPP and from an experimental Apple package that utilizes the GPU. The latter method was instable. Apple DSP 2D FFT is very slow for large images (Apple admits that it is not optimized for larger than cache size). Performance of MatrixFFT and IPP was about the same. I use IPP now.

None of these methods fully uses multithreading, which is a shame. So, for good performance, you have to write your own 2D FFT, which is easier than doing "application multithreading" for a 2D FFT, as Intel recommends.

Regards,

Adriaan van Os