1D mkl FFT Multithread Use

dfishman · ‎03-28-2011

Hi, March 28, 2011

I want to use the 1D mkl (w_mkl_10.3.2.154 w_ccompxe_2011.2.154) FFT in a multi-threaded application. I noticed that the FFT does not run as multithread.

e.g. I am running timing tests with 2^20 FFT and i found that 2^20 takes about 28 milliseconds for a forward or backward FFT.

I get this timing value for 1 CPU or for 8 CPU.

Does anyone have experience with 1D FFTs and can they share their FFT code with me; perhaps I am not calling the primitives correctly.

e.g. my calling is described below, wheren = 2^20, and Exy is the complex doubleprecision array.

type(DFTI_descriptor), pointer :: desc_handle

integer :: status

complex*16 Exy(N_Bitpnt),Exy2(N_Bitpnt)

status = DftiFreeDescriptor(Desc_Handle)
status = dfticreatedescriptor(desc_handle, 36, 32, 1, n)
status = dfticommitdescriptor(desc_handle)
status = DftiComputeForward(Desc_Handle,Exy)

Thanks,

barragan_villanueva_ · ‎03-28-2011

Hi,

Did you link your test with intel threading layer together with OpenMP library?
Please check your linking line with http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/

In case of out-of-place double precision 2^20 I can see on my machine
the following performance using MKL 10.3.3 (intel64):

- with env MKL_NUM_THREADS=1 (or OMP_NUM_THREADS=1)
Problem: 1048576, setup: 13.15 ms, time: 21.97 ms, ``gflops'': 4.7736

- with env MKL_NUM_THREADS=8 (or OMP_NUM_THREADS=8)
Problem: 1048576, setup: 46.12 ms, time: 11.44 ms, ``gflops'': 9.1675

For in-place double precision 2^20 I can see the following performance:

- with env MKL_NUM_THREADS=1 (or OMP_NUM_THREADS=1)
Problem: i1048576, setup: 10.60 ms, time: 20.71 ms, ``gflops'': 5.0619

- with env MKL_NUM_THREADS=8 (or OMP_NUM_THREADS=8)
Problem: i1048576, setup: 347.00 us, time: 6.64 ms, ``gflops'': 15.788

dfishman · ‎03-29-2011

Hi Victor,

The timing that i have cited excludes set-up times. So I guess your PC is a bit faster than mine.

I did link with the libraries as suggested. My compile & link line is shown below;

(is mkl_dfti.f90 for multi-thread use?).

ifort -c modules.f mkl_dfti.f90
ifort -extend_source -nowarn -align -Qzero -QxSSE2 -Qsave -Qopenmp -MT -Qmkl -c *.f
ifort -MT *.obj mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib /Qopenmp

Would it be possible to show me how you linked your code?

My code is written in FOTRAN.

I wonder if there is an issue in the way i am calling the MKL FFTs.

Is there any specific way of setting up the mkl calls?

I call and time the forward and backward FFT with;

etime_in1 = etime(rtm)

call zzfft(-1,n,1.d0,mctime,mctime,Table,Wsave,ISYS)

call zzfft(1,n,1.d0,mctime,mctime,Table,Wsave,ISYS)

etime_out1 = etime(rtm)

e.g. Isetup the 1D FFT with the following

if(ndir.eq.0)then
status = DftiFreeDescriptor(Desc_Handle)
status = dfticreatedescriptor(desc_handle, 36, 32, 1, n)
status = dfticommitdescriptor(desc_handle)
end if

c the forward FFT us calledwith Exy is complex*16

if(ndir.eq.-1)then
status = DftiComputeForward(Desc_Handle,Exy)
end if

if(ndir.eq.1)then
status = DftiComputeBackward(Desc_Handle,Exy)
end if

Thanks

dfishman · ‎03-30-2011

Hi Victor,

PLease note that i am running ia32 on a Windows XP x64 OS.

dfishman · ‎04-06-2011

Hi Victor,

I think I found the problem. I wasn't initializing the FFT while setting MKL_NUM_THREADS to the # of CPU; i.e. i was always setting MKL_NUM_THREADS=1 for the initialization step.

Even thoughI was settingMKL_NUM_THREADS > 1 for the actual FFT forward or backward operation.

However, the FFT speed up is only 2x for 2^18 FFT and only 33% for 2^20 while you are showing a 300% improvement for 2^20.

Do you know why this might happen? Is it CPU or cache dependent?

I am using ia32 machine with 2 Qaud 5590 3.3 GHz CPUs. The L2 cache in my machine is 12 MB.

Thanks.