Intel® Integrated Performance Primitives
Deliberate problems developing high-performance vision, signal, security, and storage applications.

FFT Open MP

simek__adam
Beginner
1,112 Views
Greetings, I have problems with running IPP 2019 FFT with OpenMP internal parallelisation, I have already tried approach: https://software.intel.com/en-us/ipp-dev-guide-using-intel-integrated-performance-primitives-threading-layer-tl-functions, but I can't find configuration of libraries to link to make this work. Using ippsFFTFwd_CToC_64fc and trying to set ippSetNumThreads to 2. (I noticed on forums for internal paralelisation u cannot use more than 2 is this true ?) For compiling i'm using gcc or icc linking includes with _tl suffix and lib /lib/intel64/tl/openmp/ with _tl suffix. I noticed you have to mix in some non _tl files still but cannot make it work, could you please provide list of files from include and lib to use to get fft working with openmp ? My env: Ubuntu 16.04 LTS Gcc 6+ Icc from Parallel Studio XE 2020 Also adding small sample code (c++11) I use, I check parallelism with vtune-gui. Thank you for response, Adam Simek
0 Kudos
7 Replies
Gennady_F_Intel
Moderator
1,112 Views

that's true: this function is not threaded internally and this function is not part of the threading layer (aka TL) yet.

0 Kudos
simek__adam
Beginner
1,112 Views

Gennady F. (Blackbelt) wrote:

that's true: this function is not threaded internally and this function is not part of the threading layer (aka TL) yet.

Thank you for reply, so in case of computations multiple larger FFTs (2^20 - 2^24), where external parallelisation would most likely be slowed down by cache memory limits it is better to use Intel MKL or FFTW3 (I assume MKL uses FFTW or am I wrong?).

0 Kudos
Adriaan_van_Os
New Contributor I
1,112 Views

It is absurd that IPP doesn't have an internally threaded FFT. Here is how to make it (see the source code of https://github.com/nickoneill/MatrixFFT)

1. Do all the 1D row FFTs threaded. For optimal speed, use a vectorized 1D FFT, such as the one in vDSP https://developer.apple.com/documentation/accelerate/vdsp?language=objc or MKL https://software.intel.com/en-us/mkl

2. Call IPP to transpose the entire image

3. Do all the 1D column( now row) FFTs threaded again.

4. Call IPP to transpose the entire image (or do further work on the result image as if it were transposed).

This way, the 2D FFT is threaded and not memory-bound.

Regards,

Adriaan van Os

 

 

0 Kudos
simek__adam
Beginner
1,112 Views

Adriaan van Os wrote:

It is absurd that IPP doesn't have an internally threaded FFT. Here is how to make it (see the source code of https://github.com/nickoneill/MatrixFFT)

1. Do all the 1D row FFTs threaded. For optimal speed, use a vectorized 1D FFT, such as the one in vDSP https://developer.apple.com/documentation/accelerate/vdsp?language=objc or MKL https://software.intel.com/en-us/mkl

2. Call IPP to transpose the entire image

3. Do all the 1D column( now row) FFTs threaded again.

4. Call IPP to transpose the entire image (or do further work on the result image as if it were transposed).

This way, the 2D FFT is threaded and not memory-bound.

Regards,

Adriaan van Os

Thank you for answer, I am actually writing paper for internally threaded FFT so I am looking for some comparison material of 1D threaded FFTs, which one do you think is faster MKL or vDSP ?

0 Kudos
Adriaan_van_Os
New Contributor I
1,112 Views

I haven't tried the MKL 1D FFT so far. The vDSP 1D FFT is not internally threaded and that is what we need here, because the most efficient threading is per row here.

The following paper is quite interesting https://github.com/nickoneill/MatrixFFT/raw/master/FFTapps.pdf.

Regards,

Adriaan van Os

 

0 Kudos
Adriaan_van_Os
New Contributor I
1,112 Views

because the most efficient threading is per row here. W

Clarification: I mean subdividing the rows into a chunk of rows for each thread to chew on. In general, that is faster than interleaving rows.

Regards,

Adriaan van Os

 

0 Kudos
simek__adam
Beginner
1,112 Views

I just read the the article and I see how it is done, I will test it.

Thank you,

Adam Simek

0 Kudos
Reply