You can improve the performance of the Intel Math Kernel Library (Intel MKL) FFT if the length of your data vector permits factorization into powers of optimized radices.
In Intel MKL, the optimized radices are 2, 3, 5, 7, and 11.
leading dimension values (n*element_size) of two-dimensional arrays are divisible by 16
for two-dimensional arrays, leading dimension values divisible by 2048 are avoided
For the C-style FFT, the distance L between arrays that represent real and imaginary
parts is not divisible by 64. The best case is when L=k*64 + 16
Leading dimension values, in bytes (n*element_size), of two-dimensional arrays are
not power of two.
could you give some explanation here on leading dimension? If I have array like Real_1, Imag_1, Real_2, Imag_2, ... is this an optimized FFT data?
[fortran]FUNCTION FGET_BUFFER_SIZE(datalen) IMPLICIT NONE INTEGER :: FGET_BUFFER_SIZE INTEGER, INTENT(IN) :: datalen REAL(8) :: ln2, ln3, ln5, ln7, ln11, ln13 REAL(8) :: lndata ln2 = DLOG(2.0_8) ln3 = DLOG(3.0_8) ln5 = DLOG(5.0_8) ln7 = DLOG(7.0_8) ln11 = DLOG(11.0_8) ln13 = DLOG(13.0_8) lndata = DLOG(DBLE(datalen)) IF (MOD(datalen,2) >= 1) THEN FGET_BUFFER_SIZE = ODD_BUFFER() ELSE FGET_BUFFER_SIZE = EVEN_BUFFER() END IF CONTAINS FUNCTION ODD_BUFFER() INTEGER :: ODD_BUFFER INTEGER :: buffsize, buffest REAL(8) :: cnt3 ,cnt5, cnt7, cnt11, cnt13, N REAL(8) :: tmp3, tmp5, tmp7, tmp11, tmp13 buffsize = HUGE(buffsize) N = DLOG(DBLE(buffsize)) cnt3 = 0.0_8 cnt5 = 0.0_8 cnt7 = 0.0_8 cnt11 = 0.0_8 cnt13 = 0.0_8 DO WHILE(cnt13*ln13
Here is how I find the optimal buffer size for MKL fft. The number returned is your data length + symetric padding. The code does some rather unusual things on the surface. It starts its search on the small end of the scale. This is because there is no "known" way to determine if you have found the best solution except to check all the values between your best estimate and thelength of your data. I did some testing on the original (before MKL added 13 as an optimized prime) and found it took on average less then 100 iterations of the inner loop to find an odd buffer size and about 300 inner loop runs to find an even buffer size. Most likely there are faster methods or improvements that could be made to this code. If you find them please post.