- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, March 28, 2011
I want to use the 1D mkl (w_mkl_10.3.2.154 w_ccompxe_2011.2.154) FFT in a multi-threaded application. I noticed that the FFT does not run as multithread.
e.g. I am running timing tests with 2^20 FFT and i found that 2^20 takes about 28 milliseconds for a forward or backward FFT.
I get this timing value for 1 CPU or for 8 CPU.
Does anyone have experience with 1D FFTs and can they share their FFT code with me; perhaps I am not calling the primitives correctly.
e.g. my calling is described below, wheren = 2^20, and Exy is the complex doubleprecision array.
type(DFTI_descriptor), pointer :: desc_handle
integer :: status
complex*16 Exy(N_Bitpnt),Exy2(N_Bitpnt)
status = DftiFreeDescriptor(Desc_Handle)
status = dfticreatedescriptor(desc_handle, 36, 32, 1, n)
status = dfticommitdescriptor(desc_handle)
status = DftiComputeForward(Desc_Handle,Exy)
Thanks,
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you link your test with intel threading layer together with OpenMP library?
Please check your linking line with http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/
In case of out-of-place double precision 2^20 I can see on my machine
the following performance using MKL 10.3.3 (intel64):
- with env MKL_NUM_THREADS=1 (or OMP_NUM_THREADS=1)
Problem: 1048576, setup: 13.15 ms, time: 21.97 ms, ``gflops'': 4.7736
- with env MKL_NUM_THREADS=8 (or OMP_NUM_THREADS=8)
Problem: 1048576, setup: 46.12 ms, time: 11.44 ms, ``gflops'': 9.1675
For in-place double precision 2^20 I can see the following performance:
- with env MKL_NUM_THREADS=1 (or OMP_NUM_THREADS=1)
Problem: i1048576, setup: 10.60 ms, time: 20.71 ms, ``gflops'': 5.0619
- with env MKL_NUM_THREADS=8 (or OMP_NUM_THREADS=8)
Problem: i1048576, setup: 347.00 us, time: 6.64 ms, ``gflops'': 15.788
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Victor,
The timing that i have cited excludes set-up times. So I guess your PC is a bit faster than mine.
I did link with the libraries as suggested. My compile & link line is shown below;
(is mkl_dfti.f90 for multi-thread use?).
ifort -c modules.f mkl_dfti.f90
ifort -extend_source -nowarn -align -Qzero -QxSSE2 -Qsave -Qopenmp -MT -Qmkl -c *.f
ifort -MT *.obj mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib /Qopenmp
Would it be possible to show me how you linked your code?
My code is written in FOTRAN.
I wonder if there is an issue in the way i am calling the MKL FFTs.
Is there any specific way of setting up the mkl calls?
I call and time the forward and backward FFT with;
etime_in1 = etime(rtm)
call zzfft(-1,n,1.d0,mctime,mctime,Table,Wsave,ISYS)
call zzfft(1,n,1.d0,mctime,mctime,Table,Wsave,ISYS)
etime_out1 = etime(rtm)
e.g. Isetup the 1D FFT with the following
if(ndir.eq.0)then
status = DftiFreeDescriptor(Desc_Handle)
status = dfticreatedescriptor(desc_handle, 36, 32, 1, n)
status = dfticommitdescriptor(desc_handle)
end if
c the forward FFT us calledwith Exy is complex*16
if(ndir.eq.-1)then
status = DftiComputeForward(Desc_Handle,Exy)
end if
if(ndir.eq.1)then
status = DftiComputeBackward(Desc_Handle,Exy)
end if
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
PLease note that i am running ia32 on a Windows XP x64 OS.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think I found the problem. I wasn't initializing the FFT while setting MKL_NUM_THREADS to the # of CPU; i.e. i was always setting MKL_NUM_THREADS=1 for the initialization step.
Even thoughI was settingMKL_NUM_THREADS > 1 for the actual FFT forward or backward operation.
However, the FFT speed up is only 2x for 2^18 FFT and only 33% for 2^20 while you are showing a 300% improvement for 2^20.
Do you know why this might happen? Is it CPU or cache dependent?
I am using ia32 machine with 2 Qaud 5590 3.3 GHz CPUs. The L2 cache in my machine is 12 MB.
Thanks.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page