- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Does anyone have data as to the # of FLOPs achieved in FFT complex single precision performance for Ivybridge or Haswell processors. I've tried downloading the benchFFT program and linking in MKL.. but benchFFT won't build with GNU or Intel compilers. I'm looking at sizes of FFTs that are 1024, 2048, 4096 and some non-powers of 2 like 1536 and 2560.

perfwise

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I remember that in past I was testing 1D FFT code which was based on Numerical Recipes source, but my tests were related to timing performance and not to #num of GFLOP/s. If you are interested I can profile that code with VTune on my Core i5 Haswell CPU.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Benchfft requires a fair amount of hacking to compile -- particularly for MKL -- but I have gotten it working for quite a few of the supported libraries. For most problem sizes only MKL and FFTW are competitive.

Ivy Bridge performance should be almost the same as Sandy Bridge at the same frequency -- the hardware differences are quite small.

Using the MKL with the Intel 13 compilers: single precision, complex forward transforms on a 3.1 GHz Sandy Bridge give:

N=1024, in-place 32.6 GFLOPS, out-of-place 32.2 GFLOPS

N=2048, in-place 29.0 GFLOPS, out-of-place 30.0 GFLOPS

N=4096, in-place 26.6 GFLOPS, out-of-place 27.7 GFLOPS

(These values are for repeated transforms of the same data (using benchfft3-1), so it does not include time to bring the data into the L1 cache or to store it back to memory.)

My Haswell systems are busy running benchmarks, but I have seen good speedups on the larger FFT sizes (relative to Sandy Bridge). Part of this is the extra memory bandwidth, part is the larger L3 cache, and part is the improved core.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Probably Haswell doubled FMADD units can speed up FFT execution. I need to check disassembly.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

FMA operations typically only provide a modest benefit for FFT algorithms. The "traditional" implementations have twice as many addition operations as multiply operations, and only some of these can be combined into FMA form. For example. according to http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.8814, a standard radix-4 butterfly calculation has 34 (real) floating-point operations that can be implemented as 6 FMAs, 6 multiplications, and 16 additions -- for a total of 28 FP instructions. Specialized FMA-based implementations have been created that can perform the same calculation with 48 floating-point operations, but using all FMAs, for a total of only 24 instructions. The FMA-based implementation also has slightly fewer loads (12 vs 14 for the original), but of course all approaches require the same 8 stores (4 complex elements).

Combining FMA with SIMD increases the complexity rather significantly. FFTW3 uses a simple trick to allow vectorization over 2 elements, but I don't know if this can be repeated for 4-element and 8-element vectors. The straightforward approach of performing transforms on multiple vectors at the same time (in parallel using SIMD instructions) certainly works, but increasing the number of vectors means that each vector must be shorter for the data to fit into the same level of cache. Using shorter vectors is a natural way to decompose larger FFTs, but it leads to more data motion -- and FFTs that don't fit in the L1 cache are almost always limited in performance by data motion.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Still even those 6 FMAD instructions per innermost for loop cycle can be executed in 5 cycles on Haswell thus freeing FADD and FMUL ciruitry for the execution of non interdependent instruction issued for example by second thread running on HT core.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Certainly FMA helps, it just does not help the standard algorithms as much as one might hope -- for the standard radix-4 computation it reduces the FP instruction count by less than 18% (28/34), while for the FMA-optimized radix-4 computation it reduces the FP instruction count by almost 30%. For L1-contained data either of these is probably enough to be helpful.

Once the data is too large to fit in L1 (either because the transform is long or because you are trying to do multiple transforms in parallel to exploit the SIMD architecture) the performance is almost always limited by the cache bandwidth. Using the efficiency of MKL for L1-contained FFTs and the measured bandwidth of the L2 cache, It looks like it is theoretically possible to stay compute-limited for L2-contained data on a Sandy Bridge or Ivy Bridge processor, but I have not seen any implementations that are able to maintain full speed once the data overflows the L1 cache.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thanks John for your comment.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

John,

Thanks for the data. You know the thing which bothers me also.. is the work done in the routine called. It used to be there was a cfft1d routine.. now there's some preparatory setup and then the computation routine. Maybe the twiddle factors are set in the new preparatory routine.. but that was all a part of the timing. I'm getting 31.1 GFLOPS on a 3.4 GHz Haswell.. but that's with all that extra work, which is about 10% of the instructions from what I'm reading from SDE. I'll see if I can get a better understanding of what's really being run and get an apples ot apples comparison.. but your performance on SB is helpful. I think I saw something to the effect it was 35 GFLOPs.. but maybe someone from MKL could tell me on this forum.

Here's my observations. The FFT opportunity for FMA is quite limited. You can't use it to replace all the muls (it doesn't do ADD+MUL, only MUL+ADD), and the latency of the FMA is 5 clks vs the shorter ADD latency. There can also be micro-arch issues as to the implementation of the FMA, which I don't digress upon (pipes, scheduling, collisions, etc.) FMA only delivers a very modest speedup. The setup code in my FFT is all fortran.. so it's not optimal which I suppose MKL's is. The reason is that in the butterfly you have an immense amount of register pressure. Most of an FFT butterfly or radix is ADD/SUB, and you compute 3 levels of temporaries or so.. so you need the results quickly. I'm getting about a 60% speedup in performance.. going from AVX128 -> AVX256. SW prefetch surprisingly is very helpful.. for some reason on FFTs not the size of the L1.. the HSW and IB hardware prefetcher are not getting it done. Maybe because they only go 1 stride ahead.. from my tests. I'm getting a 30-60% speedup just adding sw prefetching to the butterfly on say a 6144 size FFT. Another factor at play here.. I believe.. is unaligned performance. If you go to 256-bits and you're not doing a power of 2 size FFT, which is VERY common in practice in industry, then you will have unaligned requests.. and the unaligned LD and ST of 256-bit is pretty awlful on HSW and worse upon previous parts. FMA bought a 1-2% performance uplift over the sw prefetch optimized results.

Thanks again for the feedback.. and keep posting results.. I haven't found much online about the FFT performance.. and thought kickstarting a discussion about it.. would be fruitful and informative.

perfwise

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

According to the Intel Optimization Reference Manual, the L1 streaming prefetcher on recent processors only works on contiguous addresses in ascending order, so it is not at all surprising that software prefetching is a big help for L2-contained data. The L1 IP-based prefetcher is likely to get confused because the stride is different for each pass through the data. You might be able to help with this by unrolling the loops so that each stride through the data gets a separate loop. (I have not tested this, and I don't know if the loops are going to be long enough for the IP-based prefetcher to be helpful.)

The biggest problem with FFTs is power-of-two strided memory accesses and the myriad of resulting conflicts. An FFT of length 2^P will access memory with power-of-two strides from 2^0 to 2^(P-1) in both the radix calculations and in the bit-reversal re-ordering. At least some of these power of two strides will conflict in the L1 and L2 TLBs (and in the caches on TLB misses), in the L1 data cache banks (pre-Haswell), the L2 sets, the L3 sets, and in the DRAM channels/banks/ranks. The large number of subsystems where conflicts can occur, combined with the complexity (and lack of detailed documentation) of many of these subsystems makes a comprehensive analysis extraordinarily difficult. I have been working on FFTs for the last few years, and in all the literature I have reviewed I have never seen an accurate, comprehensive analysis of FFT interactions with a realistic hierarchical memory system. In my publications on specialized FFT engines we have assumed explicitly controlled SRAM memories rather than caches, so the associativity conflict problems do not arise (e.g., http://dx.doi.org/10.1109/ASAP.2013.6567572).

For transforms of arrays that fit into an L1 cache, many of the problems above either don't arise or can be worked around with a finite amount of effort. The need to use SIMD increases the complexity again, but MKL does a very good job. On my Xeon E5-2680 (Sandy Bridge EP) systems (running at 3.1 GHz), single-precision complex-to-complex transforms of length N=1024 run at 32.5 to 32.6 GFLOPS. This is the "nominal" execution rate, based on the 5*N*log(N,2) work estimate -- *not* based on the actual arithmetic performed. The best commonly used algorithm for transforms in this size range is the "split-radix" implementation, which requires about 30% fewer actual operations. Most of the avoided operations are multiplies, but the execution time is limited by the larger number of adds, which is reduced by almost 23% relative to the 3.5*N*log(N,2) "nominal" FP addition count used in the reported GFLOPS. If I did the arithmetic correctly, the 32.6 GFLOPS corresponds to an actual rate of 5.68 FP Add operations/cycle, which is 71% of the peak single-precision FP add rate for AVX-256.

For arrays larger than the L1 cache, the generalized Cooley-Tukey transformation is typically used to treat the larger FFT as a composite of smaller FFTs. A good overview is at: https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm . Optimizing the transpose operations is relatively easy, and optimizing the L1-contained FFTs has already been discussed, but overlapping the transposition with the L1-contained FFTs adds yet another level of complexity.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

John,

The data accesses in an FFT are contiguous.. so I would expect the Intel L1 HW prefetcher to latch on to these.. but I'm observing some very large benefits to sw pref.. which is telling me it's not working that well. In a radix or butterfly, there are multiple streams of accesses.. but the accesses themselves stride through the matrix in order, at least to those I've coded up, which I'm observing this anomaly upon. TLB is not a performance problem.. you really aren't having that many different pages being accessed simultaneously in a 2, 3,4,5,8 radix/butterfly.. besides I can measure the PMCs and verify it. What is a problem is data alignment.. which I've observed about a 10% performance effect upon, possibly 20% when the FFT is a power of 2 and you don't have the vector aligned to 32B that you pass in.

I measured the FFT performance in Intel MKL 11.1 upon a myriad of different problem sizes of interest. Power of 2 is not the typical request in this arena.. esp.. in Oil and Gas from past experience. I hacked upon the example provided with my MKL and specifically timed:

DftiComputeForward(hand, y);

which performed an inplace (not copied to another vector) and forward transform. Results are below upon mine and the last column is Intel 11.1.0 upon a Haswell locked to 3.4 GHz, no boost or lowering of freq due to throttling:

Size,avx256,avx256+swPf,avx256+swPf+fma3,Intel MKL 11.1.0

400,23050,21528,23236,25826

800,29973,28177,30255,26405

1000,29311,28041,30364,26753

1024,34201,32590,34178,45796

1080,26300,24824,27316,25060

1152,29289,27894,30009,26406

1200,28024,26533,28452,27887

1250,20412,19790,20937,18893

1280,34263,32800,35402,30688

1296,28140,26747,28487,25746

1350,20464,19839,21263,20979

1440,29245,28051,30670,26363

1458,16028,16259,17136,18993

1500,23171,24362,26141,25408

1536,34904,34113,35843,27623

1600,33275,32953,35449,29284

1620,23975,24492,26409,23562

1728,29716,28938,31171,26740

1800,28308,27743,30328,25515

1920,30642,29552,32608,26254

1944,27297,27646,30313,22903

2000,29257,28811,30656,27184

2048,35770,36678,38404,34304

2160,27775,29043,31656,24659

2250,18581,19910,20543,21095

2304,27828,29365,31036,25282

2400,26088,28469,30451,24686

2430,15057,17746,18351,18889

2500,20776,23279,24500,25412

2560,34861,35729,37680,26914

2592,27125,29061,30471,23093

2700,19427,23780,24867,23412

2880,24847,27962,29011,25047

2916,16095,19620,20155,22362

3000,23404,26783,27900,24972

3072,23450,27991,28714,26093

3200,23624,27964,29096,25480

3240,24398,28117,29078,23633

3456,24367,29105,29609,24625

3600,26527,28039,29213,24469

3750,17194,20002,20735,20142

3840,24552,28147,29026,24604

3888,18412,22572,22875,22247

4000,22947,28287,28944,25795

4050,14710,17986,18480,19525

4096,31362,33541,34439,29146

4320,24950,28977,29582,22494

4374,12335,16459,16796,17416

4500,21634,24925,25806,23382

4608,27805,29527,29873,24701

4800,24060,28513,29559,24382

4860,15238,20196,20540,21804

5000,22712,27880,28972,23343

5120,21769,28338,29035,22944

5184,28232,29308,29817,22127

5400,25179,29064,29879,22536

5760,28210,29925,30329,23693

5832,17726,23128,23285,19950

6000,27859,29239,30269,23986

6144,22121,29253,29757,22901

6250,17378,20712,21365,17988

6400,26127,29395,30135,24063

6480,20261,23555,24004,22521

6750,14237,18290,18942,18428

6912,29177,30040,30731,21110

7200,27861,29991,30880,22058

7290,13704,18553,18824,18031

7500,19031,24864,26084,21158

7680,28763,30537,31083,23930

7776,20454,24025,24245,21222

8000,26429,30093,30677,26225

8100,17401,21420,21879,21458

8192,21595,28189,28397,29238

8640,29409,30799,31197,22104

8748,15018,20919,21163,17990

9000,26132,30184,31034,22117

9216,28345,30140,30537,23063

9600,28022,30809,31649,23358

9720,18315,23977,24482,20153

64000,25210,26667,26781,17539

Surprising the variance of performance in MKL.. but their 1024 size FFT performance is quite remarkable. That's the only size I've not yet beaten yet in performance.. but the bar is so high I can't fathom at present what to do to get there. Very interesting.. and you can do very large FFTs without being throttled for bandwidth, as I've illustrated on the last one.. a FFT of a vector with 64K elements.

I'm getting 11.2 FLOP per clock peak.. on a few of the FFTs above.. and i'm using the definition of the # of FLOPS in a complex-complex FFT as 5*N*LOG(N)/LOG(2).

Perfwise

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page