- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I would like to pad my array/Matrix to get best FFT performance.

There is some guideline in the manual saying the dimension should not be multiple of 2048 (for 2D FFT) and aligned.

I can align my 1D and 2D arrays, but I wonder how do I really get the optmized size. If I would like to generate an optmized size table, how it will look like?

In FFTW, they usually use multiple of small prime numbers. Does this apply to MKL?

Please provide some guideline on how to generate a table of optimized dimension number. This apply to both 1D and 2D.

Also for 2D, the row and col will follow the same rule, right?

To summarize, if I have 1D array, say 457 numbers, and I run FFT on it. What is the optimized dimension I should pad my array to?

for 2D, if it is 457x671, what is the optmized dimension?

Thank you.

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

For the best performance please provide 16-byte alignment of data.

As to optimized radices: MKL User Guide states (see

**Tips and Techniques to Improve Performance**)

# FFT Optimized Radices

You can improve the performance of the Intel Math Kernel Library (Intel MKL) FFT if the length of your data vector permits factorization into powers of optimized radices.

In Intel MKL, the optimized radices are 2, 3, 5, 7, and 11.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I also read in the MKL manual which says:

leading dimension values (n*element_size) of two-dimensional arrays are divisible by 16

for two-dimensional arrays, leading dimension values divisible by 2048 are avoided

For the C-style FFT, the distance L between arrays that represent real and imaginary

parts is not divisible by 64. The best case is when L=k*64 + 16

Leading dimension values, in bytes (n*element_size), of two-dimensional arrays are

not power of two.

could you give some explanation here on leading dimension? If I have array like Real_1, Imag_1, Real_2, Imag_2, ... is this an optimized FFT data?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

[fortran]FUNCTION FGET_BUFFER_SIZE(datalen) IMPLICIT NONE INTEGER :: FGET_BUFFER_SIZE INTEGER, INTENT(IN) :: datalen REAL(8) :: ln2, ln3, ln5, ln7, ln11, ln13 REAL(8) :: lndata ln2 = DLOG(2.0_8) ln3 = DLOG(3.0_8) ln5 = DLOG(5.0_8) ln7 = DLOG(7.0_8) ln11 = DLOG(11.0_8) ln13 = DLOG(13.0_8) lndata = DLOG(DBLE(datalen)) IF (MOD(datalen,2) >= 1) THEN FGET_BUFFER_SIZE = ODD_BUFFER() ELSE FGET_BUFFER_SIZE = EVEN_BUFFER() END IF CONTAINS FUNCTION ODD_BUFFER() INTEGER :: ODD_BUFFER INTEGER :: buffsize, buffest REAL(8) :: cnt3 ,cnt5, cnt7, cnt11, cnt13, N REAL(8) :: tmp3, tmp5, tmp7, tmp11, tmp13 buffsize = HUGE(buffsize) N = DLOG(DBLE(buffsize)) cnt3 = 0.0_8 cnt5 = 0.0_8 cnt7 = 0.0_8 cnt11 = 0.0_8 cnt13 = 0.0_8 DO WHILE(cnt13*ln13Here is how I find the optimal buffer size for MKL fft. The number returned is your data length + symetric padding. The code does some rather unusual things on the surface. It starts its search on the small end of the scale. This is because there is no "known" way to determine if you have found the best solution except to check all the values between your best estimate and thelength of your data. I did some testing on the original (before MKL added 13 as an optimized prime) and found it took on average less then 100 iterations of the inner loop to find an odd buffer size and about 300 inner loop runs to find an even buffer size. Most likely there are faster methods or improvements that could be made to this code. If you find them please post.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- FUNCTIONFGET_BUFFER_SIZE(datalen)
- IMPLICITNONE
- INTEGER::FGET_BUFFER_SIZE
- INTEGER,INTENT(IN)::datalen
- REAL(8)::ln2,ln3,ln5,ln7,ln11,ln13
- REAL(8)::lndata
- ln2=DLOG(2.0_8)
- ln3=DLOG(3.0_8)
- ln5=DLOG(5.0_8)
- ln7=DLOG(7.0_8)
- ln11=DLOG(11.0_8)
- ln13=DLOG(13.0_8)
- lndata=DLOG(DBLE(datalen))
- IF(MOD(datalen,2)>=1)THEN
- FGET_BUFFER_SIZE=ODD_BUFFER()
- ELSE
- FGET_BUFFER_SIZE=EVEN_BUFFER()
- ENDIF
- CONTAINS
- FUNCTIONODD_BUFFER()
- INTEGER::ODD_BUFFER
- INTEGER::buffsize,buffest
- REAL(8)::cnt3,cnt5,cnt7,cnt11,cnt13,N
- REAL(8)::tmp3,tmp5,tmp7,tmp11,tmp13
- buffsize=HUGE(buffsize)
- N=DLOG(DBLE(buffsize))
- cnt3=0.0_8
- cnt5=0.0_8
- cnt7=0.0_8
- cnt11=0.0_8
- cnt13=0.0_8
- DOWHILE(cnt13*ln13
- tmp13=cnt13*ln13
- DOWHILE(cnt11*ln11+tmp13
- tmp11=cnt11*ln11+tmp13
- DOWHILE(cnt7*ln7+tmp11
- tmp7=cnt7*ln7+tmp11
- DOWHILE(cnt5*ln5+tmp7
- tmp5=cnt5*ln5+tmp7
- cnt3=DNINT((lndata-tmp5)/ln3+0.45_8)
- DOWHILE(cnt3*ln3+tmp5
- tmp3=cnt3*ln3+tmp5
- buffest=NINT(DEXP(tmp3))
- IF(buffest.GE.datalen)THEN
- N=tmp3
- buffsize=buffest
- IF(buffsize==datalen)GOTO200
- ENDIF
- cnt3=cnt3+1.0_8
- ENDDO
- cnt3=0.0_8
- cnt5=cnt5+1.0_8
- ENDDO
- cnt5=0.0_8
- cnt7=cnt7+1.0_8
- ENDDO
- cnt7=0.0_8
- cnt11=cnt11+1.0_8
- ENDDO
- cnt11=0.0_8
- cnt13=cnt13+1.0_8
- ENDDO
- 200CONTINUE
- ODD_BUFFER=buffsize
- ENDFUNCTIONODD_BUFFER
- FUNCTIONEVEN_BUFFER()
- INTEGER::EVEN_BUFFER
- INTEGER::buffsize,buffest
- REAL(8)::cnt2,cnt3,cnt5,cnt7,cnt11,cnt13,N
- REAL(8)::tmp2,tmp3,tmp5,tmp7,tmp11,tmp13
- buffsize=HUGE(buffsize)
- N=DLOG(DBLE(buffsize))
- cnt3=0.0_8
- cnt5=0.0_8
- cnt7=0.0_8
- cnt11=0.0_8
- cnt13=0.0_8
- DOWHILE(cnt13*ln13
- tmp13=cnt13*ln13
- DOWHILE(ln2+cnt11*ln11+tmp13
- tmp11=ln2+cnt11*ln11+tmp13
- DOWHILE(cnt7*ln7+tmp11
- tmp7=cnt7*ln7+tmp11
- DOWHILE(cnt5*ln5+tmp7
- tmp5=cnt5*ln5+tmp7
- DOWHILE(cnt3*ln3+tmp5
- tmp3=cnt3*ln3+tmp5-ln2
- cnt2=DNINT((lndata-tmp3)/ln2+0.45_8)
- If(cnt2<1.0_8)cnt2=1.0_8
- DOWHILE(cnt2*ln2+tmp3
- tmp2=cnt2*ln2+tmp3
- buffest=NINT(DEXP(tmp2))
- IF(buffest.GE.datalen)THEN
- N=tmp2
- buffsize=buffest
- IF(buffsize==datalen)GOTO300
- ENDIF
- cnt2=cnt2+1.0_8
- ENDDO
- cnt2=1.0_8
- cnt3=cnt3+1.0_8
- ENDDO
- cnt3=0.0_8
- cnt5=cnt5+1.0_8
- ENDDO
- cnt5=0.0_8
- cnt7=cnt7+1.0_8
- ENDDO
- cnt7=0.0_8
- cnt11=cnt11+1.0_8
- ENDDO
- cnt11=0.0_8
- cnt13=cnt13+1.0_8
- ENDDO
- 300CONTINUE
- EVEN_BUFFER=buffsize
- ENDFUNCTIONEVEN_BUFFER
- ENDFUNCTIONFGET_BUFFER_SIZE

[fortran]FUNCTION FGET_BUFFER_SIZE(datalen) IMPLICIT NONE INTEGER :: FGET_BUFFER_SIZE INTEGER, INTENT(IN) :: datalen REAL(8) :: ln2, ln3, ln5, ln7, ln11, ln13 REAL(8) :: lndata ln2 = DLOG(2.0_8) ln3 = DLOG(3.0_8) ln5 = DLOG(5.0_8) ln7 = DLOG(7.0_8) ln11 = DLOG(11.0_8) ln13 = DLOG(13.0_8) lndata = DLOG(DBLE(datalen)) IF (MOD(datalen,2) >= 1) THEN FGET_BUFFER_SIZE = ODD_BUFFER() ELSE FGET_BUFFER_SIZE = EVEN_BUFFER() END IF CONTAINS FUNCTION ODD_BUFFER() INTEGER :: ODD_BUFFER INTEGER :: buffsize, buffest REAL(8) :: cnt3 ,cnt5, cnt7, cnt11, cnt13, N REAL(8) :: tmp3, tmp5, tmp7, tmp11, tmp13 buffsize = HUGE(buffsize) N = DLOG(DBLE(buffsize)) cnt3 = 0.0_8 cnt5 = 0.0_8 cnt7 = 0.0_8 cnt11 = 0.0_8 cnt13 = 0.0_8 DO WHILE(cnt13*ln13Here is how I find the optimal buffer size for MKL fft. The number returned is your data length + symetric padding. The code does some rather unusual things on the surface. It starts its search on the small end of the scale. This is because there is no "known" way to determine if you have found the best solution except to check all the values between your best estimate and thelength of your data. I did some testing on the original (before MKL added 13 as an optimized prime) and found it took on average less then 100 iterations of the inner loop to find an odd buffer size and about 300 inner loop runs to find an even buffer size. Most likely there are faster methods or improvements that could be made to this code. If you find them please post.

Thanks much,

Ozzer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

dated 12 August 2010

Intel Visual Fortran Compiler Professional Edition 11.1 for Windows

Installation Guide and Release Notes

Performance improvements

FFTS

Enhanced performance for transforms which are multiple of 8 of 13

Optimized 1d complex cluster FFTs for non-power-of-2 vector lengths

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

**MKL Release 10.2, Update 4.**

-Ozzer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

There are some considerations about selecting optimized dimensions and data layout for multidimensional transforms recently published in MKL Knowledge Base article FFT length and layout advisor. This may not be exactly what you are looking for, but it may help.

Thanks

Dima

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page