- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I would like to share my short experience with Xeon Phi and hope that the performance can be improved.
My program solves a problem similar to the heat equation. Each iteration contains 3 Fourier transforsm and 2 simple matrix operation (N operations). I wrote a omp program using fftw library. In order to run using mkl library I compiled the wrappers which are provided with the compiler with libmic option.
I ran the code on our cards using intel 15.0.2 and mkl 5.0.2. I had time to test only 2 sizes (512x512 and 1024x1024). My main conclusion is that the mic is slower than 2 or 4 cores.
I might be doing something wrong and I would like you opinions on this. We have 90 cards and there seldom used. It would be useful resources .
I do no tknow how to attach a file, so here is my code:
program threadstest
implicit none
include "fftw3.f"
INTEGER :: TID, OMP_GET_NUM_THREADS,OMP_GET_THREAD_NUM
integer, parameter :: lcell=16,ncell=32
integer, parameter :: lx=ncell*2*lcell,ly=ncell*2*lcell
integer, parameter :: lxhp=lx/2+1
integer, parameter :: nout=1000,nsteps=10
integer :: nthreads =1
double precision,allocatable :: psi(:,:),nt(:,:),bffr(:,:)
double complex,allocatable :: ckpsi(:,:),bffk(:,:),cknt(:,:)
integer*8 fppsi,bppsi
integer iret,i,j,n,sss
double precision :: t1,t2
double precision :: at,qt,dx,dy,dkx,dky,rr,pm,dt,iV
REAL :: rate
INTEGER :: c1,c2,cr,cm,nn
allocate(psi(lx,ly),nt(lx,ly),bffr(lx,ly))
allocate(ckpsi(lx/2+1,ly),bffk(lx/2+1,ly),cknt(lx/2+1,ly))
write(*,*) lx,ly
! First initialize the system_clock
CALL system_clock(count_rate=cr)
CALL system_clock(count_max=cm)
rate = REAL(cr)
! WRITE(*,*) "system_clock rate ",rate
! call testthreads()
! nthreads=omp_get_num_threads()
write(*,*) omp_get_num_threads()
CALL OMP_SET_NUM_THREADS(nthreads)
write (*,*)nthreads
! write(*,*) omp_get_num_threads()
call dfftw_init_threads(iret)
call dfftw_plan_with_nthreads(nthreads)
write(*,*) iret
call dfftw_plan_dft_r2c_2d(fppsi,lx,ly,psi,ckpsi,
* FFTW_ESTIMATE)
call dfftw_plan_dft_c2r_2d(bppsi,lx,ly,ckpsi,psi,
* FFTW_ESTIMATE)
at=2.0*acos(-1.0)/(sqrt(3.0)/2.0)
qt=2*acos(-1.0)/at
dx=at/16.0
dy=(at*sqrt(3.0)/2.0)/16.0
dkx=2.0*acos(-1.0)/(Lx*dx)
dky=2.0*acos(-1.0)/(Ly*dy)
dt=0.25
pm=-0.25
rr=-0.25
iV=1.0/(lx*ly)
call random_number(psi(1:lx,1:ly))
do j=0,ly-1
do i=0,lx-1
psi(i+1,j+1)=0.524*(cos(i*dx*qt)*cos(j*dy*qt/sqrt(3.0))
* +0.5*cos(j*dy*qt*2.0/sqrt(3.0)))+pm
* +(psi(i+1,j+1)*2.0-1.0)*0.05
enddo
enddo
write(*,*) sum(psi)/(lx*ly)
psi=psi+pm-sum(psi)/(lx*ly)
call cpu_time(t1)
CALL SYSTEM_CLOCK(c1)
call energy()
bffr=psi
call dfftw_execute_dft_r2c(fppsi,bffr,bffk)
bffk=bffk*iV
do sss=1,nsteps
do n=1,nout
call update()
enddo
call energy()
bffr=psi
call dfftw_execute_dft_r2c(fppsi,bffr,bffk)
bffk=bffk*iV
enddo
call cpu_time(t2)
CALL SYSTEM_CLOCK(c2)
write(*,*) (t2-t1), 'seconds, total cpu time'
write(*,*) (c2-c1)/rate, 'seconds. closer to run time'
contains
subroutine update()
double precision :: clin,cnon,qx,qy,ksq
double complex :: psitmp
integer :: loci,locj, omp_get_num_threads
!$OMP PARALLEL private(loci,locj)
! write(*,*) omp_get_num_threads()
!$OMP DO
! write(*,*) omp_get_num_threads()
do locj=1,ly
do loci=1,lx
nt(loci,locj)=psi(loci,locj)**3
enddo
enddo
!$OMP END PARALLEL
call dfftw_execute_dft_r2c(fppsi,nt,cknt)
! cknt=cknt*iV
!$OMP PARALLEL PRIVATE(LOCI,LOCJ,QX,QY,KSQ,CLIN,CNON,PSITMP)
!$OMP DO
do locj=1,ly
if(locj.le.ly/2+1) then
qy=(locj-1)*dky
else
qy=(locj-1-ly)*dky
endif
do loci=1,lx/2+1
qx=(loci-1)*dkx
ksq=-(qx*qx+qy*qy)
clin=1.0/(1.0-dt*ksq*(rr+1.0+2.0*ksq+ksq*ksq))
cnon=ksq*dt/(1.0-dt*ksq*(rr+1.0+2.0*ksq+ksq*ksq))
psitmp=bffk(loci,locj)*clin+(cknt(loci,locj)*iV)*cnon
bffk(loci,locj)=psitmp
ckpsi(loci,locj)=psitmp
enddo
enddo
!$OMP END PARALLEL
call dfftw_execute_dft_c2r(bppsi,ckpsi,psi)
end subroutine
subroutine energy()
double precision :: ccc,qx,qy,ksq
integer :: loci,locj
!$OMP PARALLEL private(loci,locj)
!$OMP DO
do locj=1,ly
do loci=1,lx
bffr(loci,locj)=psi(loci,locj)
enddo
enddo
!$OMP END PARALLEL
call dfftw_execute_dft_r2c(fppsi,bffr,bffk)
! bffk=bffk*iV
!$OMP PARALLEL PRIVATE(LOCI,LOCJ,QX,QY,KSQ,CCC)
!$OMP DO
do locj=1,ly
if(locj.le.ly/2+1) then
qy=(locj-1)*dky
else
qy=(locj-1-ly)*dky
endif
do loci=1,lx/2+1
qx=(loci-1)*dkx
ksq=-(qx*qx+qy*qy)
ccc=(rr+1.0+2.0*ksq+ksq*ksq)
bffk(loci,locj)=(bffk(loci,locj)*iV)*ccc
enddo
enddo
!$OMP END PARALLEL
call dfftw_execute_dft_c2r(bppsi,bffk,bffr)
write(*,*) sum(0.5*psi*bffr+0.25*psi**4)*iV
end subroutine
end program
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have not done a lot of FFT work on the MIC, but I have tested 1 million (2^20) element single precision complex FFTs and gotten good performance. I started with the code at the link above and modified it to do a 1D FFT of size 2^20 instead of a 2D FFT of size 2^10 x 2^10.
Performance was good, with 80% of the runs delivering 112 GFLOPS to 119 GFLOPS using 60 threads on 60 cores, increasing to 122-126 GFLOPS using 120 threads on 60 cores. (Performance dropped a little bit using 3 or 4 threads per core.)
Transparent large pages improve the performance significantly, with 180 threads giving a median of 163 GFLOPS on this test case.
On an 8-core Xeon E5-2680 (Sandy Bridge, 2.7 GHz nominal, 3.1 GHz max all-core Turbo), I got about 75 GFLOPS on the same problem with MKL and using 8 cores on 1 socket, increasing to just over 120 GFLOPS using all 16 cores (both sockets). Using FFTW3 and "PATIENT" planning, I got better results on 1 socket -- 93 GFLOPS with 8 threads -- but did not see any improvement going to more threads. (I might not have set up the memory affinity correctly, but since I was not interested in the 2-socket case I did not follow up on this.)
So the Xeon Phi should be capable of running problems in this size range at good speeds.
You might want to separate the FFTs for testing purposes so that you can time them in isolation, then add them back in and make sure that the timing for the FFT step does not change significantly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Thanks you for your replies. I will try to follow the tutorial. I suppose I am not the first one trying to do fft on Xeon Phi, so I hope people who have some experience with this to share their success. I know performance can vary for different sizes, but can a Xeon Phi beat 2 haswell cpu? We have quite good access to the cluster with nodes containing 2 hsw cpu or 2 sandy bridge cpu and it would be difficult to justify spending money on cards or time on programming if we can not beat those. The fft are more than 99% of the computations, so I think the code I have is a good code for testing.
This tutorial assumes C. Is doable for Fortran?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MKL calls are at least as easy from Fortran. With ifort, you have the advantage of the option -align array64byte, in case you don't want to embed the alignment directives.
If you're not being charged for time on Haswell servers but are being charged for development or use of MIC, it seems you might not make the effort to run your fft on both host and/or multiple coprocessors so as to improve your elapsed time. I suppose you may need to run parallel fft (multiple cases in parallel) to see an advantage from running many host or coprocessor cores on relative small problems.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't expect that a single Xeon Phi could beat 2 Haswells -- maybe if you got lucky for a few carefully chosen problem sizes...
The 2^20 single-precision complex FFT that I reported above shows better performance on Haswell than on Sandy Bridge. I don't have MKL results, but FFTW3 using 8 threads in one socket delivered up to 146 GFLOPS on a Xeon E5-2690 v3. The Xeon E5-2680 (Sandy Bridge EP) delivered ~93 GFLOPS (8 cores in 1 socket) in systems with one dual-rank DIMM per channel and ~125 GFLOPS (8 cores in 1 socket) in systems with two dual-rank DIMMs per channel. This suggests (very strongly) that this test case is falling out of the L3 cache and running into lots of DRAM bank conflicts in the single-DIMM Sandy Bridge system. The Haswell system has more L3 cache (30 MiB vs 25 MiB), and each rank of DDR4 has twice as many banks as each rank of DDR3 (16 vs 8) (though the DDR4 sub-banking does not provide as much generality as 2 ranks of DDR3), so these results are not surprising.
The ~163 GFLOPS that I got on Xeon Phi is still higher than the ~146 GFLOPS on a single socket of Haswell, but not by very much.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Maybe beating is too strong word I hope at least performance is on the same level. I have tested 1node (with hsw cpu) vs CUDA (k40) and this are the running times:
These runs were done with 24 cores using haswell cpus.
time (s)
Lx Ly (MPI) CUDA
768 768 5.9 8
1563 1536 40 34.5
3072 3072 156 140
4608 4608 454 296
6144 6144 873 626
Since K40 has theoretical performance (double precision) close to Xeon Phi (mic), I was hoping for similar performance.
I did not have time to check try the tips from the, but I will get to it in the next days.
One reason to use accelerators is the also the efficiency. One node with 2 cpu consumes 24 cpu hours, while using accelerators only 2. Also the ability to use accelerators with so few changes to the openmp codes, is very attractive, but performance is also important.
John D. McCalpin wrote:
I don't expect that a single Xeon Phi could beat 2 Haswells -- maybe if you got lucky for a few carefully chosen problem sizes...
The 2^20 single-precision complex FFT that I reported above shows better performance on Haswell than on Sandy Bridge. I don't have MKL results, but FFTW3 using 8 threads in one socket delivered up to 146 GFLOPS on a Xeon E5-2690 v3. The Xeon E5-2680 (Sandy Bridge EP) delivered ~93 GFLOPS (8 cores in 1 socket) in systems with one dual-rank DIMM per channel and ~125 GFLOPS (8 cores in 1 socket) in systems with two dual-rank DIMMs per channel. This suggests (very strongly) that this test case is falling out of the L3 cache and running into lots of DRAM bank conflicts in the single-DIMM Sandy Bridge system. The Haswell system has more L3 cache (30 MiB vs 25 MiB), and each rank of DDR4 has twice as many banks as each rank of DDR3 (16 vs 8) (though the DDR4 sub-banking does not provide as much generality as 2 ranks of DDR3), so these results are not surprising.
The ~163 GFLOPS that I got on Xeon Phi is still higher than the ~146 GFLOPS on a single socket of Haswell, but not by very much.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
According to specs I found online, a single Haswell is capable of doing ~300 GFLOPS using AVX2 or even ~600 GFLOPS using AVX2+FMA. A Xeon phi 5000 series should be able to achieve 1 TFLOPS. There are, of course, theoretical maxima, but yes, in theory, a dual socket Haswell will beat a single Xeon Phi.
FWIW: I've got a dual socket E5 2697v3 for testing and ran a very simple openmp-based pi calculation example on it, and compared results to the Xeon Phi. The Xeon Phi is slightly faster, needing 0.157 seconds to do 2 billion iterations and using 239 threads.
The dual socket Haswell (2x14 cores+HT) needed 0.180 seconds using 54 threads (this is the optimal number of threads).
So for this particular (artificial) test, the Phi does beat the 2 Haswell's.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I apologize for the confusion. I have several codes doing the same thing (parallel fft) and the openmp gives almost the same performance to mpi when I use only one node in our clusters.
I tried a openmp version of fft because it requires no change to the code to use it on mic in native mode.
Cristian
Evgueni Petrov aka espetrov (Intel) wrote:
Specific problems with Intel MKL performance on Xeon Phi can be addressed to Intel MKL Forum.
https://software.intel.com/en-us/forums/intel-math-kernel-library
Just post a reproducer there.BTW, the code that you posted does not contain any MPI calls. How you are using MPI? Do you compare against cuFFT-XT?
Evgueni.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page