Poor DGEMM performance on MIC

PKM · ‎08-04-2015

Hi

I have implemented an embarrasingly parallel part of a code for Xeon Phi offload and the performance ends up being bottlenecked by matrix-matrix multiplications. On regular Xeons i get decent performance, but on Xeon Phi the result is horrible. I have written the small benchmark program below, which captures the essence of the problem quite well.

Basically, I end up doing DGEMM's at 76 GFlops on a single Xeon E5-2650 V2 but only 51 GFlops on a Xeon Phi 3120. In others words around 50% of peak performance on the Xeon, but only 5% on the Phi ...I thought the Phi was supposed to shine in DGEMM?

C

    program Console1
    use omp_lib
    use ifport
    implicit none
    integer :: NReps,I,M,K,N
    parameter (M=21,K=88,N=12)
    real*8 :: AMat(M,K)
    real*8 :: BMat(K,N)
    real*8 :: CMat(M,N)
    save AMat,BMat,CMat
    real*8  :: TBegin,TEnd,Sum,Alpha,Beta
    logical :: Success,NTHreads    
!$OMP THREADPRIVATE(AMat,BMat,CMat)  
!dir$ attributes offload : mic :: AMat,BMat,CMat,NReps,I,TBegin,TEnd,Sum,Alpha,Beta
!DIR$ ATTRIBUTES ALIGN : 64    :: AMat,BMat,CMat   
    
    !Common settings
    success=SETENVQQ("KMP_AFFINITY=scatter")
    Alpha=1.0
    Beta=0.0
    
    !CPU benchmark
    NThreads=8
    NReps=1000000
    call OMP_SET_NUM_THREADS(NThreads)
    TBegin=OMP_GET_WTIME()
    Sum=0.0
!$OMP PARALLEL Default(PRIVATE) SHARED(NReps,Alpha,Beta) REDUCTION(+:Sum)
    AMat(:,:)=0.01
    BMat(:,:)=0.02
    do i=1,NReps
      !CMat=matmul(AMat,BMat)
      CALL DGEMM('N','N',M,N,K,Alpha,AMat,M,BMat,K,Beta,CMat,M)  
      AMat(1,1)=CMat(1,1)*1.01
      BMat(1,1)=CMat(1,2)*1.02
    end do 
    Sum=Sum+CMat(1,1)+AMat(1,1)+BMat(1,1)
!$OMP END PARALLEL
    TEnd=OMP_GET_WTIME()  
    print *,'Sum=',Sum
    print *,'CPU GFlops=',2.0*M*N*K*NReps*NThreads/((TEnd-TBegin)*1d9)
    
    
    !MIC benchmark
    NThreads=224
    NReps=100000
!DIR$ OFFLOAD BEGIN TARGET(mic:0) INOUT(AMat,BMat,CMat,NReps,TBegin,TEnd,Sum,NThreads)
    call OMP_SET_NUM_THREADS(NThreads)
    TBegin=OMP_GET_WTIME()
    Sum=0.0
!$OMP PARALLEL Default(PRIVATE) SHARED(NReps,Alpha,Beta) REDUCTION(+:Sum)
    AMat(:,:)=0.01
    BMat(:,:)=0.02
    do i=1,NReps
      !CMat=matmul(AMat,BMat)    
      CALL DGEMM('N','N',M,N,K,Alpha,AMat,M,BMat,K,Beta,CMat,M) 
      AMat(1,1)=CMat(1,1)*1.01
      BMat(1,1)=CMat(1,2)*1.02
    end do
    Sum=Sum+CMat(1,1)+AMat(1,1)+BMat(1,1)
!$OMP END PARALLEL
    TEnd=OMP_GET_WTIME()   
!DIR$ END OFFLOAD 
    print *,'Sum=',Sum
    print *,'MIC GFlops=',2.0*M*N*K*NReps*NThreads/((TEnd-TBegin)*1d9)
    
    end program Console1

TimP · ‎08-04-2015

Mkl for coprocessor gets good performance only for large matrices where it is permitted to interleave threads for data movement and computation. Assuming you allow just one thread to each multiplication this can't happen. It may not be surprising if you get less performance than you would with one thread per core.

Offload is particularly inefficient for small problems. The Colfax documents discuss necessary relationship of operations per byte transferred to approach peak in register performance. You will note that the thresholds for automatic offload are orders of magnitude greater than the sizes you have set.

PKM · ‎08-04-2015

Data transfer time is 100% negligible for what I am doing. I always spend several minutes inside the offload section, doing computations on a single data transfer.

Unfortunately, It is not possible for me to change the core parallelization. I am stuck with using a single thread for each DGEMM. So I guess you are saying the only way I can improve performance is if I can somehow crank up the matrix size?

VipinKumar_E_Intel · ‎08-04-2015

As Tim pointed out, the matrix sizes are too small for Phi.

Also, Automatic Offload threshold is also around 1280 and more for M, N and greater than 256 for K.