Hello,

Cristian_Vasile_A_ · ‎06-10-2015

Hello,

I would like to share my short experience with Xeon Phi and hope that the performance can be improved.

My program solves a problem similar to the heat equation. Each iteration contains 3 Fourier transforsm and 2 simple matrix operation (N operations). I wrote a omp program using fftw library. In order to run using mkl library I compiled the wrappers which are provided with the compiler with libmic option.

I ran the code on our cards using intel 15.0.2 and mkl 5.0.2. I had time to test only 2 sizes (512x512 and 1024x1024). My main conclusion is that the mic is slower than 2 or 4 cores.

I might be doing something wrong and I would like you opinions on this. We have 90 cards and there seldom used. It would be useful resources .

I do no tknow how to attach a file, so here is my code:

      program threadstest
      implicit none
      include "fftw3.f"
      INTEGER :: TID, OMP_GET_NUM_THREADS,OMP_GET_THREAD_NUM
      integer, parameter :: lcell=16,ncell=32
      integer, parameter :: lx=ncell*2*lcell,ly=ncell*2*lcell
      integer, parameter :: lxhp=lx/2+1
      integer, parameter :: nout=1000,nsteps=10
      integer :: nthreads =1
      double precision,allocatable :: psi(:,:),nt(:,:),bffr(:,:)
      double complex,allocatable :: ckpsi(:,:),bffk(:,:),cknt(:,:)
      integer*8 fppsi,bppsi
      integer iret,i,j,n,sss
      double precision :: t1,t2
      double precision :: at,qt,dx,dy,dkx,dky,rr,pm,dt,iV
      REAL :: rate
      INTEGER :: c1,c2,cr,cm,nn
      allocate(psi(lx,ly),nt(lx,ly),bffr(lx,ly))
      allocate(ckpsi(lx/2+1,ly),bffk(lx/2+1,ly),cknt(lx/2+1,ly))
      write(*,*) lx,ly
      ! First initialize the system_clock
      CALL system_clock(count_rate=cr)
      CALL system_clock(count_max=cm)
      rate = REAL(cr)
!      WRITE(*,*) "system_clock rate ",rate
!      call testthreads()

!      nthreads=omp_get_num_threads()
      write(*,*) omp_get_num_threads()
      CALL OMP_SET_NUM_THREADS(nthreads)
      write (*,*)nthreads
!      write(*,*) omp_get_num_threads()
      call dfftw_init_threads(iret)
      call dfftw_plan_with_nthreads(nthreads)
      write(*,*) iret
      call dfftw_plan_dft_r2c_2d(fppsi,lx,ly,psi,ckpsi,
     *                           FFTW_ESTIMATE)
      call dfftw_plan_dft_c2r_2d(bppsi,lx,ly,ckpsi,psi,
     *                           FFTW_ESTIMATE)

      at=2.0*acos(-1.0)/(sqrt(3.0)/2.0)
      qt=2*acos(-1.0)/at
      dx=at/16.0
      dy=(at*sqrt(3.0)/2.0)/16.0
      dkx=2.0*acos(-1.0)/(Lx*dx)
      dky=2.0*acos(-1.0)/(Ly*dy)
      dt=0.25
      pm=-0.25
      rr=-0.25
      iV=1.0/(lx*ly)
      call random_number(psi(1:lx,1:ly))
      do j=0,ly-1
      do i=0,lx-1
      psi(i+1,j+1)=0.524*(cos(i*dx*qt)*cos(j*dy*qt/sqrt(3.0))
     *            +0.5*cos(j*dy*qt*2.0/sqrt(3.0)))+pm
     *            +(psi(i+1,j+1)*2.0-1.0)*0.05
      enddo
      enddo
      write(*,*) sum(psi)/(lx*ly)
      psi=psi+pm-sum(psi)/(lx*ly)

      call cpu_time(t1)
      CALL SYSTEM_CLOCK(c1)

      call energy()
      bffr=psi
      call dfftw_execute_dft_r2c(fppsi,bffr,bffk)

      bffk=bffk*iV

      do sss=1,nsteps

      do n=1,nout
      call update()
      enddo

      call energy()

      bffr=psi
      call dfftw_execute_dft_r2c(fppsi,bffr,bffk)
      bffk=bffk*iV
      enddo

      call cpu_time(t2)
      CALL SYSTEM_CLOCK(c2)
      write(*,*) (t2-t1), 'seconds, total cpu time'
      write(*,*) (c2-c1)/rate, 'seconds. closer to run time'

      contains

      subroutine update()
      double precision :: clin,cnon,qx,qy,ksq
      double complex :: psitmp
      integer :: loci,locj, omp_get_num_threads

!$OMP PARALLEL private(loci,locj)
!      write(*,*) omp_get_num_threads()
!$OMP DO
!      write(*,*) omp_get_num_threads()
      do locj=1,ly
      do loci=1,lx
      nt(loci,locj)=psi(loci,locj)**3
      enddo
      enddo
!$OMP END PARALLEL

      call dfftw_execute_dft_r2c(fppsi,nt,cknt)
!      cknt=cknt*iV
!$OMP PARALLEL PRIVATE(LOCI,LOCJ,QX,QY,KSQ,CLIN,CNON,PSITMP)
!$OMP DO
      do locj=1,ly

      if(locj.le.ly/2+1) then
      qy=(locj-1)*dky
      else
      qy=(locj-1-ly)*dky
      endif

      do loci=1,lx/2+1
      qx=(loci-1)*dkx
      ksq=-(qx*qx+qy*qy)

      clin=1.0/(1.0-dt*ksq*(rr+1.0+2.0*ksq+ksq*ksq))
      cnon=ksq*dt/(1.0-dt*ksq*(rr+1.0+2.0*ksq+ksq*ksq))

      psitmp=bffk(loci,locj)*clin+(cknt(loci,locj)*iV)*cnon


      bffk(loci,locj)=psitmp
      ckpsi(loci,locj)=psitmp
      enddo
      enddo
!$OMP END PARALLEL

      call dfftw_execute_dft_c2r(bppsi,ckpsi,psi)

      end subroutine
      subroutine energy()
      double precision :: ccc,qx,qy,ksq
      integer :: loci,locj

!$OMP PARALLEL private(loci,locj)
!$OMP DO
      do locj=1,ly
      do loci=1,lx
      bffr(loci,locj)=psi(loci,locj)
      enddo
      enddo
!$OMP END PARALLEL
      call dfftw_execute_dft_r2c(fppsi,bffr,bffk)

!      bffk=bffk*iV
!$OMP PARALLEL PRIVATE(LOCI,LOCJ,QX,QY,KSQ,CCC)
!$OMP DO
      do locj=1,ly

      if(locj.le.ly/2+1) then
      qy=(locj-1)*dky
      else
      qy=(locj-1-ly)*dky
      endif

      do loci=1,lx/2+1
      qx=(loci-1)*dkx
      ksq=-(qx*qx+qy*qy)

      ccc=(rr+1.0+2.0*ksq+ksq*ksq)

      bffk(loci,locj)=(bffk(loci,locj)*iV)*ccc
      enddo
      enddo
!$OMP END PARALLEL

      call dfftw_execute_dft_c2r(bppsi,bffk,bffr)

      write(*,*) sum(0.5*psi*bffr+0.25*psi**4)*iV
      end subroutine

      end program

Evgueni_P_Intel · ‎06-10-2015

Hi Cristian Vasile A.

Since it is Intel MKL that works behind the FFTW3 wrappers, the following article may be useful.

Evgueni.

https://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors

McCalpinJohn · ‎06-10-2015

I have not done a lot of FFT work on the MIC, but I have tested 1 million (2^20) element single precision complex FFTs and gotten good performance. I started with the code at the link above and modified it to do a 1D FFT of size 2^20 instead of a 2D FFT of size 2^10 x 2^10.

Performance was good, with 80% of the runs delivering 112 GFLOPS to 119 GFLOPS using 60 threads on 60 cores, increasing to 122-126 GFLOPS using 120 threads on 60 cores. (Performance dropped a little bit using 3 or 4 threads per core.)

Transparent large pages improve the performance significantly, with 180 threads giving a median of 163 GFLOPS on this test case.

On an 8-core Xeon E5-2680 (Sandy Bridge, 2.7 GHz nominal, 3.1 GHz max all-core Turbo), I got about 75 GFLOPS on the same problem with MKL and using 8 cores on 1 socket, increasing to just over 120 GFLOPS using all 16 cores (both sockets). Using FFTW3 and "PATIENT" planning, I got better results on 1 socket -- 93 GFLOPS with 8 threads -- but did not see any improvement going to more threads. (I might not have set up the memory affinity correctly, but since I was not interested in the 2-socket case I did not follow up on this.)

So the Xeon Phi should be capable of running problems in this size range at good speeds.

You might want to separate the FFTs for testing purposes so that you can time them in isolation, then add them back in and make sure that the timing for the FFT step does not change significantly.

Cristian_Vasile_A_ · ‎06-10-2015

Hello,

Thanks you for your replies. I will try to follow the tutorial. I suppose I am not the first one trying to do fft on Xeon Phi, so I hope people who have some experience with this to share their success. I know performance can vary for different sizes, but can a Xeon Phi beat 2 haswell cpu? We have quite good access to the cluster with nodes containing 2 hsw cpu or 2 sandy bridge cpu and it would be difficult to justify spending money on cards or time on programming if we can not beat those. The fft are more than 99% of the computations, so I think the code I have is a good code for testing.

This tutorial assumes C. Is doable for Fortran?

TimP · ‎06-10-2015

MKL calls are at least as easy from Fortran. With ifort, you have the advantage of the option -align array64byte, in case you don't want to embed the alignment directives.

If you're not being charged for time on Haswell servers but are being charged for development or use of MIC, it seems you might not make the effort to run your fft on both host and/or multiple coprocessors so as to improve your elapsed time. I suppose you may need to run parallel fft (multiple cases in parallel) to see an advantage from running many host or coprocessor cores on relative small problems.

McCalpinJohn · ‎06-10-2015

I don't expect that a single Xeon Phi could beat 2 Haswells -- maybe if you got lucky for a few carefully chosen problem sizes...

The 2^20 single-precision complex FFT that I reported above shows better performance on Haswell than on Sandy Bridge. I don't have MKL results, but FFTW3 using 8 threads in one socket delivered up to 146 GFLOPS on a Xeon E5-2690 v3. The Xeon E5-2680 (Sandy Bridge EP) delivered ~93 GFLOPS (8 cores in 1 socket) in systems with one dual-rank DIMM per channel and ~125 GFLOPS (8 cores in 1 socket) in systems with two dual-rank DIMMs per channel. This suggests (very strongly) that this test case is falling out of the L3 cache and running into lots of DRAM bank conflicts in the single-DIMM Sandy Bridge system. The Haswell system has more L3 cache (30 MiB vs 25 MiB), and each rank of DDR4 has twice as many banks as each rank of DDR3 (16 vs 8) (though the DDR4 sub-banking does not provide as much generality as 2 ranks of DDR3), so these results are not surprising.

The ~163 GFLOPS that I got on Xeon Phi is still higher than the ~146 GFLOPS on a single socket of Haswell, but not by very much.

Cristian_Vasile_A_ · ‎06-11-2015

Hello,

Maybe beating is too strong word I hope at least performance is on the same level. I have tested 1node (with hsw cpu) vs CUDA (k40) and this are the running times:

These runs were done with 24 cores using haswell cpus.

                                 time (s)
Lx        Ly        (MPI)    CUDA
768     768         5.9         8
1563   1536        40         34.5
3072   3072        156       140
4608   4608        454       296
6144   6144        873       626

Since K40 has theoretical performance (double precision) close to Xeon Phi (mic), I was hoping for similar performance.

I did not have time to check try the tips from the, but I will get to it in the next days.

One reason to use accelerators is the also the efficiency. One node with 2 cpu consumes 24 cpu hours, while using accelerators only 2. Also the ability to use accelerators with so few changes to the openmp codes, is very attractive, but performance is also important.

John D. McCalpin wrote:

I don't expect that a single Xeon Phi could beat 2 Haswells -- maybe if you got lucky for a few carefully chosen problem sizes...

The 2^20 single-precision complex FFT that I reported above shows better performance on Haswell than on Sandy Bridge. I don't have MKL results, but FFTW3 using 8 threads in one socket delivered up to 146 GFLOPS on a Xeon E5-2690 v3. The Xeon E5-2680 (Sandy Bridge EP) delivered ~93 GFLOPS (8 cores in 1 socket) in systems with one dual-rank DIMM per channel and ~125 GFLOPS (8 cores in 1 socket) in systems with two dual-rank DIMMs per channel. This suggests (very strongly) that this test case is falling out of the L3 cache and running into lots of DRAM bank conflicts in the single-DIMM Sandy Bridge system. The Haswell system has more L3 cache (30 MiB vs 25 MiB), and each rank of DDR4 has twice as many banks as each rank of DDR3 (16 vs 8) (though the DDR4 sub-banking does not provide as much generality as 2 ranks of DDR3), so these results are not surprising.

The ~163 GFLOPS that I got on Xeon Phi is still higher than the ~146 GFLOPS on a single socket of Haswell, but not by very much.

Evgueni_P_Intel · ‎06-11-2015

Hi Cristian Vasile A.,

Specific problems with Intel MKL performance on Xeon Phi can be addressed to Intel MKL Forum.

https://software.intel.com/en-us/forums/intel-math-kernel-library

Just post a reproducer there.

BTW, the code that you posted does not contain any MPI calls. How you are using MPI? Do you compare against cuFFT-XT?

Evgueni.

JJK · ‎06-11-2015

According to specs I found online, a single Haswell is capable of doing ~300 GFLOPS using AVX2 or even ~600 GFLOPS using AVX2+FMA. A Xeon phi 5000 series should be able to achieve 1 TFLOPS. There are, of course, theoretical maxima, but yes, in theory, a dual socket Haswell will beat a single Xeon Phi.

FWIW: I've got a dual socket E5 2697v3 for testing and ran a very simple openmp-based pi calculation example on it, and compared results to the Xeon Phi. The Xeon Phi is slightly faster, needing 0.157 seconds to do 2 billion iterations and using 239 threads.

The dual socket Haswell (2x14 cores+HT) needed 0.180 seconds using 54 threads (this is the optimal number of threads).

So for this particular (artificial) test, the Phi does beat the 2 Haswell's.

Cristian_Vasile_A_ · ‎06-11-2015

Hello,

I apologize for the confusion. I have several codes doing the same thing (parallel fft) and the openmp gives almost the same performance to mpi when I use only one node in our clusters.

I tried a openmp version of fft because it requires no change to the code to use it on mic in native mode.

Cristian

Evgueni Petrov aka espetrov (Intel) wrote:

Hi Cristian Vasile A.,

Specific problems with Intel MKL performance on Xeon Phi can be addressed to Intel MKL Forum.

https://software.intel.com/en-us/forums/intel-math-kernel-library
Just post a reproducer there.

BTW, the code that you posted does not contain any MPI calls. How you are using MPI? Do you compare against cuFFT-XT?

Evgueni.

Cristian_Vasile_A_ · ‎06-12-2015

Hello, Some updates on trials. I tried to follow the tutorial mentioned before. I guess things got better with some of the settings. These are the run times: time (s) Lx Ly 2 hsw mic 768 768 6 92 1563 1536 40 202 3072 3072 156 565 The scaling looks good, it might catch with the xeon for very large matrices, but otherwise still slow.

Poor perfomance of FFT on the mic