FFT Open MP

simek__adam · ‎02-25-2020

Greetings, I have problems with running IPP 2019 FFT with OpenMP internal parallelisation, I have already tried approach: https://software.intel.com/en-us/ipp-dev-guide-using-intel-integrated-performance-primitives-threading-layer-tl-functions, but I can't find configuration of libraries to link to make this work. Using ippsFFTFwd_CToC_64fc and trying to set ippSetNumThreads to 2. (I noticed on forums for internal paralelisation u cannot use more than 2 is this true ?) For compiling i'm using gcc or icc linking includes with _tl suffix and lib /lib/intel64/tl/openmp/ with _tl suffix. I noticed you have to mix in some non _tl files still but cannot make it work, could you please provide list of files from include and lib to use to get fft working with openmp ? My env: Ubuntu 16.04 LTS Gcc 6+ Icc from Parallel Studio XE 2020 Also adding small sample code (c++11) I use, I check parallelism with vtune-gui. Thank you for response, Adam Simek

Gennady_F_Intel · ‎02-28-2020

that's true: this function is not threaded internally and this function is not part of the threading layer (aka TL) yet.

simek__adam · ‎02-29-2020

Gennady F. (Blackbelt) wrote:
that's true: this function is not threaded internally and this function is not part of the threading layer (aka TL) yet.

Thank you for reply, so in case of computations multiple larger FFTs (2^20 - 2^24), where external parallelisation would most likely be slowed down by cache memory limits it is better to use Intel MKL or FFTW3 (I assume MKL uses FFTW or am I wrong?).

Adriaan_van_Os · ‎02-29-2020

It is absurd that IPP doesn't have an internally threaded FFT. Here is how to make it (see the source code of https://github.com/nickoneill/MatrixFFT)

1. Do all the 1D row FFTs threaded. For optimal speed, use a vectorized 1D FFT, such as the one in vDSP https://developer.apple.com/documentation/accelerate/vdsp?language=objc or MKL https://software.intel.com/en-us/mkl

2. Call IPP to transpose the entire image

3. Do all the 1D column( now row) FFTs threaded again.

4. Call IPP to transpose the entire image (or do further work on the result image as if it were transposed).

This way, the 2D FFT is threaded and not memory-bound.

Regards,

Adriaan van Os

simek__adam · ‎02-29-2020

Adriaan van Os wrote:
It is absurd that IPP doesn't have an internally threaded FFT. Here is how to make it (see the source code of https://github.com/nickoneill/MatrixFFT)
1. Do all the 1D row FFTs threaded. For optimal speed, use a vectorized 1D FFT, such as the one in vDSP https://developer.apple.com/documentation/accelerate/vdsp?language=objc or MKL https://software.intel.com/en-us/mkl
2. Call IPP to transpose the entire image
3. Do all the 1D column( now row) FFTs threaded again.
4. Call IPP to transpose the entire image (or do further work on the result image as if it were transposed).
This way, the 2D FFT is threaded and not memory-bound.
Regards,
Adriaan van Os

Thank you for answer, I am actually writing paper for internally threaded FFT so I am looking for some comparison material of 1D threaded FFTs, which one do you think is faster MKL or vDSP ?

Adriaan_van_Os · ‎02-29-2020

I haven't tried the MKL 1D FFT so far. The vDSP 1D FFT is not internally threaded and that is what we need here, because the most efficient threading is per row here.

The following paper is quite interesting https://github.com/nickoneill/MatrixFFT/raw/master/FFTapps.pdf.

Regards,

Adriaan van Os

Adriaan_van_Os · ‎02-29-2020

because the most efficient threading is per row here. W

Clarification: I mean subdividing the rows into a chunk of rows for each thread to chew on. In general, that is faster than interleaving rows.

Regards,

Adriaan van Os

simek__adam · ‎02-29-2020

I just read the the article and I see how it is done, I will test it.

Thank you,

Adam Simek