- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Tags:
- Development Tools
- General Support
- Intel® Integrated Performance Primitives
- Parallel Computing
- Vectorization
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
that's true: this function is not threaded internally and this function is not part of the threading layer (aka TL) yet.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Gennady F. (Blackbelt) wrote:that's true: this function is not threaded internally and this function is not part of the threading layer (aka TL) yet.
Thank you for reply, so in case of computations multiple larger FFTs (2^20 - 2^24), where external parallelisation would most likely be slowed down by cache memory limits it is better to use Intel MKL or FFTW3 (I assume MKL uses FFTW or am I wrong?).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is absurd that IPP doesn't have an internally threaded FFT. Here is how to make it (see the source code of https://github.com/nickoneill/MatrixFFT)
1. Do all the 1D row FFTs threaded. For optimal speed, use a vectorized 1D FFT, such as the one in vDSP https://developer.apple.com/documentation/accelerate/vdsp?language=objc or MKL https://software.intel.com/en-us/mkl
2. Call IPP to transpose the entire image
3. Do all the 1D column( now row) FFTs threaded again.
4. Call IPP to transpose the entire image (or do further work on the result image as if it were transposed).
This way, the 2D FFT is threaded and not memory-bound.
Regards,
Adriaan van Os
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Adriaan van Os wrote:It is absurd that IPP doesn't have an internally threaded FFT. Here is how to make it (see the source code of https://github.com/nickoneill/MatrixFFT)
1. Do all the 1D row FFTs threaded. For optimal speed, use a vectorized 1D FFT, such as the one in vDSP https://developer.apple.com/documentation/accelerate/vdsp?language=objc or MKL https://software.intel.com/en-us/mkl
2. Call IPP to transpose the entire image
3. Do all the 1D column( now row) FFTs threaded again.
4. Call IPP to transpose the entire image (or do further work on the result image as if it were transposed).
This way, the 2D FFT is threaded and not memory-bound.
Regards,
Adriaan van Os
Thank you for answer, I am actually writing paper for internally threaded FFT so I am looking for some comparison material of 1D threaded FFTs, which one do you think is faster MKL or vDSP ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I haven't tried the MKL 1D FFT so far. The vDSP 1D FFT is not internally threaded and that is what we need here, because the most efficient threading is per row here.
The following paper is quite interesting https://github.com/nickoneill/MatrixFFT/raw/master/FFTapps.pdf.
Regards,
Adriaan van Os
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
because the most efficient threading is per row here. W
Clarification: I mean subdividing the rows into a chunk of rows for each thread to chew on. In general, that is faster than interleaving rows.
Regards,
Adriaan van Os
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I just read the the article and I see how it is done, I will test it.
Thank you,
Adam Simek
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page