Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.

1D mkl FFT Multithread Use

dfishman
Beginner
1,445 Views

Hi, March 28, 2011

I want to use the 1D mkl (w_mkl_10.3.2.154 w_ccompxe_2011.2.154) FFT in a multi-threaded application. I noticed that the FFT does not run as multithread.

e.g. I am running timing tests with 2^20 FFT and i found that 2^20 takes about 28 milliseconds for a forward or backward FFT.

I get this timing value for 1 CPU or for 8 CPU.

Does anyone have experience with 1D FFTs and can they share their FFT code with me; perhaps I am not calling the primitives correctly.

e.g. my calling is described below, wheren = 2^20, and Exy is the complex doubleprecision array.

type(DFTI_descriptor), pointer :: desc_handle

integer :: status

complex*16 Exy(N_Bitpnt),Exy2(N_Bitpnt)

status = DftiFreeDescriptor(Desc_Handle)
status = dfticreatedescriptor(desc_handle, 36, 32, 1, n)
status = dfticommitdescriptor(desc_handle)
status = DftiComputeForward(Desc_Handle,Exy)

Thanks,

0 Kudos
4 Replies
barragan_villanueva_
Valued Contributor I
1,445 Views
Hi,

Did you link your test with intel threading layer together with OpenMP library?
Please check your linking line with http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/

In case of out-of-place double precision 2^20 I can see on my machine
the following performance using MKL 10.3.3 (intel64):

- with env MKL_NUM_THREADS=1 (or OMP_NUM_THREADS=1)
Problem: 1048576, setup: 13.15 ms, time: 21.97 ms, ``gflops'': 4.7736

- with env MKL_NUM_THREADS=8 (or OMP_NUM_THREADS=8)
Problem: 1048576, setup: 46.12 ms, time: 11.44 ms, ``gflops'': 9.1675

For in-place double precision 2^20 I can see the following performance:

- with env MKL_NUM_THREADS=1 (or OMP_NUM_THREADS=1)
Problem: i1048576, setup: 10.60 ms, time: 20.71 ms, ``gflops'': 5.0619

- with env MKL_NUM_THREADS=8 (or OMP_NUM_THREADS=8)
Problem: i1048576, setup: 347.00 us, time: 6.64 ms, ``gflops'': 15.788
0 Kudos
dfishman
Beginner
1,445 Views

Hi Victor,

The timing that i have cited excludes set-up times. So I guess your PC is a bit faster than mine.

I did link with the libraries as suggested. My compile & link line is shown below;

(is mkl_dfti.f90 for multi-thread use?).

ifort -c modules.f mkl_dfti.f90
ifort -extend_source -nowarn -align -Qzero -QxSSE2 -Qsave -Qopenmp -MT -Qmkl -c *.f
ifort -MT *.obj mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib /Qopenmp

Would it be possible to show me how you linked your code?

My code is written in FOTRAN.

I wonder if there is an issue in the way i am calling the MKL FFTs.

Is there any specific way of setting up the mkl calls?

I call and time the forward and backward FFT with;

etime_in1 = etime(rtm)

call zzfft(-1,n,1.d0,mctime,mctime,Table,Wsave,ISYS)

call zzfft(1,n,1.d0,mctime,mctime,Table,Wsave,ISYS)

etime_out1 = etime(rtm)

e.g. Isetup the 1D FFT with the following

if(ndir.eq.0)then
status = DftiFreeDescriptor(Desc_Handle)
status = dfticreatedescriptor(desc_handle, 36, 32, 1, n)
status = dfticommitdescriptor(desc_handle)
end if

c the forward FFT us calledwith Exy is complex*16

if(ndir.eq.-1)then
status = DftiComputeForward(Desc_Handle,Exy)
end if

if(ndir.eq.1)then
status = DftiComputeBackward(Desc_Handle,Exy)
end if

Thanks

0 Kudos
dfishman
Beginner
1,445 Views
Hi Victor,

PLease note that i am running ia32 on a Windows XP x64 OS.
0 Kudos
dfishman
Beginner
1,445 Views
Hi Victor,

I think I found the problem. I wasn't initializing the FFT while setting MKL_NUM_THREADS to the # of CPU; i.e. i was always setting MKL_NUM_THREADS=1 for the initialization step.

Even thoughI was settingMKL_NUM_THREADS > 1 for the actual FFT forward or backward operation.

However, the FFT speed up is only 2x for 2^18 FFT and only 33% for 2^20 while you are showing a 300% improvement for 2^20.

Do you know why this might happen? Is it CPU or cache dependent?

I am using ia32 machine with 2 Qaud 5590 3.3 GHz CPUs. The L2 cache in my machine is 12 MB.

Thanks.
0 Kudos
Reply