I have an application that performs many small digital image correlations in Fourier space in subareas of a bigger frame. So far, I've been using FFTW or MKL to do the FFTs in parallel, i.e. one FFT per thread. I recently got a Xeon Phi x200 to play with, and I wonder, how SIMD-parallel FFTs would perform in comparisson, i.e. each thread shall compute 16 single precission FFTs (or even 32 because of two AVX-512 units per core?).
Has anybody any experiecence or feeling about this, or maybe even knows a library that can do this? The FFT'd arrays are fixed in size for a given frame but can range anywhere between 16x16 and maybe say 64x64 depending on external parameters, with 20x20 and 24x24 being most relevant and most frequent sizes. Could data parallel FFT be significantly faster than task parellel computation?
In the olden days (TM), data parallel was the best implementation approach for FFTs on vector supercomputers. For recent and current systems with multi-level cache hierarchies, the "best" answer will depend rather sensitively on the size of the arrays relative to the required vector width and the cache size(s) of the systems of interest.
As you have probably already surmised, the trouble with doing 8 data parallel FFTs is that each of the series only gets 1/8th of the cache.
I have not done much work with FFT optimization on KNL (yet), but on the mainstream Xeon processors there is a huge boost in performance if the data can be L1-contained. If your data is single-precision real, then the arrays are between 2KiB (for 16x16) and 32KiB (64x64), with the more common sizes being 4.5KiB (24x24). For most of these cases the input array, the output array, and the twiddle factor vector will fit in the L1 Data Cache, and there are enough independent rows/columns to perform data parallel FFTs for the column/row transforms, respectively. The ugly problem is with the transpositions. None of the Intel architectures are really good at transposing data at word-level -- a good discussion of code optimization techniques for the transposition of very small arrays is contained in Section 11.11 of the Intel Optimization Reference Manual (document 248966-033, June 2016). This does not include AVX-512, but the issues are largely the same -- the optimum transpose implementation for vectorized code will require some combination of "swizzle" instructions and re-loading the data from the L1 Data Cache. Unfortunately the discussion in Section 11.11 only includes relative speedup and not absolute performance, so extra homework is required to understand the relative costs of the transpose and the transforms.
Once the data gets bigger than the L2 cache, the time required to perform the memory transfers can dominate the execution time, and vectorization of any kind becomes less important. I often run experiments to compare the execution time for the tiny FFTs contained in the cache with the execution time of copying the data in and out from further out in the memory hierarchy. As an example, suppose I decided that the processor could perform small FFTs at a rate of 4 CPU cycles per element, then the 24x24 array would take 2304 cycles to perform the row transforms and 2304 cycles to perform the column transforms, plus some unknown time to perform the transposition(s). If I assume that I need to load a new 4.5KiB input array, load a new 4.5KiB output array, write back the current 4.5KiB input array, and write back the current 4.5 KiB output array, then I need 18KiB of memory traffic for every new 24x24 block. With data in MCDRAM, read+write bandwidth (using all cores) can be slightly over 400 GB/s, so the 18KiB of data transfers should only take about 46 ns. BUT, this bandwidth is shared by all the cores, so each core will take 46ns * 68 cores = 3128 cycles per core for the data motion. This is smaller than the compute estimate above, so in this (hypothetical) case there is still some benefit to be obtained from improving the FFT compute performance. It should be clear that if the FFT gets much faster (say a factor of 2), then the overall performance will be limited by MCDRAM bandwidth, and further improvements in compute performance will have minimal impact on throughput.
There are, of course, lots more special cases to consider, including in-place vs out-of-place transforms, stored vs computed twiddle factors, and L2-contained vs L1-contained data. Because of the importance of data motion in FFTs, this is a case where an asynchronous threading model may may sense -- on each physical core you may have one thread doing the computation of the FFTs and another thread moving "old" data out of the cache and prefetching the next block into the cache. It would take a pretty big investigation to decide whether this was likely to be beneficial, and a larger effort to instantiate such a code.
Gregg S. (Intel) wrote:
My understanding is FFTW has the feature. I am expecting it soon in MKL also, but not yet.
Are you referring to "Efficient handling of multiple, strided transforms. (This lets you do things like transform multiple arrays at once, transform one dimension of a multi-dimensional array, or transform one field of a multi-component array.)" on http://fftw.org/?
From http://www.fftw.org/fftw3_doc/Advanced-Complex-DFTs.html, it is not clear to me that FFTW does the transforms data parallel. Do you know if this is the case? Will MKL feature multiple, data parallel FFTs on AVX-512?