Does anyone have data as to the # of FLOPs achieved in FFT complex single precision performance for Ivybridge or Haswell processors. I've tried downloading the benchFFT program and linking in MKL.. but benchFFT won't build with GNU or Intel compilers. I'm looking at sizes of FFTs that are 1024, 2048, 4096 and some non-powers of 2 like 1536 and 2560.
I remember that in past I was testing 1D FFT code which was based on Numerical Recipes source, but my tests were related to timing performance and not to #num of GFLOP/s. If you are interested I can profile that code with VTune on my Core i5 Haswell CPU.
Benchfft requires a fair amount of hacking to compile -- particularly for MKL -- but I have gotten it working for quite a few of the supported libraries. For most problem sizes only MKL and FFTW are competitive.
Ivy Bridge performance should be almost the same as Sandy Bridge at the same frequency -- the hardware differences are quite small.
Using the MKL with the Intel 13 compilers: single precision, complex forward transforms on a 3.1 GHz Sandy Bridge give:
N=1024, in-place 32.6 GFLOPS, out-of-place 32.2 GFLOPS
N=2048, in-place 29.0 GFLOPS, out-of-place 30.0 GFLOPS
N=4096, in-place 26.6 GFLOPS, out-of-place 27.7 GFLOPS
(These values are for repeated transforms of the same data (using benchfft3-1), so it does not include time to bring the data into the L1 cache or to store it back to memory.)
My Haswell systems are busy running benchmarks, but I have seen good speedups on the larger FFT sizes (relative to Sandy Bridge). Part of this is the extra memory bandwidth, part is the larger L3 cache, and part is the improved core.
FMA operations typically only provide a modest benefit for FFT algorithms. The "traditional" implementations have twice as many addition operations as multiply operations, and only some of these can be combined into FMA form. For example. according to http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.8814, a standard radix-4 butterfly calculation has 34 (real) floating-point operations that can be implemented as 6 FMAs, 6 multiplications, and 16 additions -- for a total of 28 FP instructions. Specialized FMA-based implementations have been created that can perform the same calculation with 48 floating-point operations, but using all FMAs, for a total of only 24 instructions. The FMA-based implementation also has slightly fewer loads (12 vs 14 for the original), but of course all approaches require the same 8 stores (4 complex elements).
Combining FMA with SIMD increases the complexity rather significantly. FFTW3 uses a simple trick to allow vectorization over 2 elements, but I don't know if this can be repeated for 4-element and 8-element vectors. The straightforward approach of performing transforms on multiple vectors at the same time (in parallel using SIMD instructions) certainly works, but increasing the number of vectors means that each vector must be shorter for the data to fit into the same level of cache. Using shorter vectors is a natural way to decompose larger FFTs, but it leads to more data motion -- and FFTs that don't fit in the L1 cache are almost always limited in performance by data motion.
Still even those 6 FMAD instructions per innermost for loop cycle can be executed in 5 cycles on Haswell thus freeing FADD and FMUL ciruitry for the execution of non interdependent instruction issued for example by second thread running on HT core.
Certainly FMA helps, it just does not help the standard algorithms as much as one might hope -- for the standard radix-4 computation it reduces the FP instruction count by less than 18% (28/34), while for the FMA-optimized radix-4 computation it reduces the FP instruction count by almost 30%. For L1-contained data either of these is probably enough to be helpful.
Once the data is too large to fit in L1 (either because the transform is long or because you are trying to do multiple transforms in parallel to exploit the SIMD architecture) the performance is almost always limited by the cache bandwidth. Using the efficiency of MKL for L1-contained FFTs and the measured bandwidth of the L2 cache, It looks like it is theoretically possible to stay compute-limited for L2-contained data on a Sandy Bridge or Ivy Bridge processor, but I have not seen any implementations that are able to maintain full speed once the data overflows the L1 cache.
Thanks for the data. You know the thing which bothers me also.. is the work done in the routine called. It used to be there was a cfft1d routine.. now there's some preparatory setup and then the computation routine. Maybe the twiddle factors are set in the new preparatory routine.. but that was all a part of the timing. I'm getting 31.1 GFLOPS on a 3.4 GHz Haswell.. but that's with all that extra work, which is about 10% of the instructions from what I'm reading from SDE. I'll see if I can get a better understanding of what's really being run and get an apples ot apples comparison.. but your performance on SB is helpful. I think I saw something to the effect it was 35 GFLOPs.. but maybe someone from MKL could tell me on this forum.
Here's my observations. The FFT opportunity for FMA is quite limited. You can't use it to replace all the muls (it doesn't do ADD+MUL, only MUL+ADD), and the latency of the FMA is 5 clks vs the shorter ADD latency. There can also be micro-arch issues as to the implementation of the FMA, which I don't digress upon (pipes, scheduling, collisions, etc.) FMA only delivers a very modest speedup. The setup code in my FFT is all fortran.. so it's not optimal which I suppose MKL's is. The reason is that in the butterfly you have an immense amount of register pressure. Most of an FFT butterfly or radix is ADD/SUB, and you compute 3 levels of temporaries or so.. so you need the results quickly. I'm getting about a 60% speedup in performance.. going from AVX128 -> AVX256. SW prefetch surprisingly is very helpful.. for some reason on FFTs not the size of the L1.. the HSW and IB hardware prefetcher are not getting it done. Maybe because they only go 1 stride ahead.. from my tests. I'm getting a 30-60% speedup just adding sw prefetching to the butterfly on say a 6144 size FFT. Another factor at play here.. I believe.. is unaligned performance. If you go to 256-bits and you're not doing a power of 2 size FFT, which is VERY common in practice in industry, then you will have unaligned requests.. and the unaligned LD and ST of 256-bit is pretty awlful on HSW and worse upon previous parts. FMA bought a 1-2% performance uplift over the sw prefetch optimized results.
Thanks again for the feedback.. and keep posting results.. I haven't found much online about the FFT performance.. and thought kickstarting a discussion about it.. would be fruitful and informative.
According to the Intel Optimization Reference Manual, the L1 streaming prefetcher on recent processors only works on contiguous addresses in ascending order, so it is not at all surprising that software prefetching is a big help for L2-contained data. The L1 IP-based prefetcher is likely to get confused because the stride is different for each pass through the data. You might be able to help with this by unrolling the loops so that each stride through the data gets a separate loop. (I have not tested this, and I don't know if the loops are going to be long enough for the IP-based prefetcher to be helpful.)
The biggest problem with FFTs is power-of-two strided memory accesses and the myriad of resulting conflicts. An FFT of length 2^P will access memory with power-of-two strides from 2^0 to 2^(P-1) in both the radix calculations and in the bit-reversal re-ordering. At least some of these power of two strides will conflict in the L1 and L2 TLBs (and in the caches on TLB misses), in the L1 data cache banks (pre-Haswell), the L2 sets, the L3 sets, and in the DRAM channels/banks/ranks. The large number of subsystems where conflicts can occur, combined with the complexity (and lack of detailed documentation) of many of these subsystems makes a comprehensive analysis extraordinarily difficult. I have been working on FFTs for the last few years, and in all the literature I have reviewed I have never seen an accurate, comprehensive analysis of FFT interactions with a realistic hierarchical memory system. In my publications on specialized FFT engines we have assumed explicitly controlled SRAM memories rather than caches, so the associativity conflict problems do not arise (e.g., http://dx.doi.org/10.1109/ASAP.2013.6567572).
For transforms of arrays that fit into an L1 cache, many of the problems above either don't arise or can be worked around with a finite amount of effort. The need to use SIMD increases the complexity again, but MKL does a very good job. On my Xeon E5-2680 (Sandy Bridge EP) systems (running at 3.1 GHz), single-precision complex-to-complex transforms of length N=1024 run at 32.5 to 32.6 GFLOPS. This is the "nominal" execution rate, based on the 5*N*log(N,2) work estimate -- *not* based on the actual arithmetic performed. The best commonly used algorithm for transforms in this size range is the "split-radix" implementation, which requires about 30% fewer actual operations. Most of the avoided operations are multiplies, but the execution time is limited by the larger number of adds, which is reduced by almost 23% relative to the 3.5*N*log(N,2) "nominal" FP addition count used in the reported GFLOPS. If I did the arithmetic correctly, the 32.6 GFLOPS corresponds to an actual rate of 5.68 FP Add operations/cycle, which is 71% of the peak single-precision FP add rate for AVX-256.
For arrays larger than the L1 cache, the generalized Cooley-Tukey transformation is typically used to treat the larger FFT as a composite of smaller FFTs. A good overview is at: https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm . Optimizing the transpose operations is relatively easy, and optimizing the L1-contained FFTs has already been discussed, but overlapping the transposition with the L1-contained FFTs adds yet another level of complexity.
The data accesses in an FFT are contiguous.. so I would expect the Intel L1 HW prefetcher to latch on to these.. but I'm observing some very large benefits to sw pref.. which is telling me it's not working that well. In a radix or butterfly, there are multiple streams of accesses.. but the accesses themselves stride through the matrix in order, at least to those I've coded up, which I'm observing this anomaly upon. TLB is not a performance problem.. you really aren't having that many different pages being accessed simultaneously in a 2, 3,4,5,8 radix/butterfly.. besides I can measure the PMCs and verify it. What is a problem is data alignment.. which I've observed about a 10% performance effect upon, possibly 20% when the FFT is a power of 2 and you don't have the vector aligned to 32B that you pass in.
I measured the FFT performance in Intel MKL 11.1 upon a myriad of different problem sizes of interest. Power of 2 is not the typical request in this arena.. esp.. in Oil and Gas from past experience. I hacked upon the example provided with my MKL and specifically timed:
which performed an inplace (not copied to another vector) and forward transform. Results are below upon mine and the last column is Intel 11.1.0 upon a Haswell locked to 3.4 GHz, no boost or lowering of freq due to throttling:
Size,avx256,avx256+swPf,avx256+swPf+fma3,Intel MKL 11.1.0
Surprising the variance of performance in MKL.. but their 1024 size FFT performance is quite remarkable. That's the only size I've not yet beaten yet in performance.. but the bar is so high I can't fathom at present what to do to get there. Very interesting.. and you can do very large FFTs without being throttled for bandwidth, as I've illustrated on the last one.. a FFT of a vector with 64K elements.
I'm getting 11.2 FLOP per clock peak.. on a few of the FFTs above.. and i'm using the definition of the # of FLOPS in a complex-complex FFT as 5*N*LOG(N)/LOG(2).