- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I have implemented an embarrasingly parallel part of a code for Xeon Phi offload and the performance ends up being bottlenecked by matrix-matrix multiplications. On regular Xeons i get decent performance, but on Xeon Phi the result is horrible. I have written the small benchmark program below, which captures the essence of the problem quite well.
Basically, I end up doing DGEMM's at 76 GFlops on a single Xeon E5-2650 V2 but only 51 GFlops on a Xeon Phi 3120. In others words around 50% of peak performance on the Xeon, but only 5% on the Phi ...I thought the Phi was supposed to shine in DGEMM?
C
program Console1 use omp_lib use ifport implicit none integer :: NReps,I,M,K,N parameter (M=21,K=88,N=12) real*8 :: AMat(M,K) real*8 :: BMat(K,N) real*8 :: CMat(M,N) save AMat,BMat,CMat real*8 :: TBegin,TEnd,Sum,Alpha,Beta logical :: Success,NTHreads !$OMP THREADPRIVATE(AMat,BMat,CMat) !dir$ attributes offload : mic :: AMat,BMat,CMat,NReps,I,TBegin,TEnd,Sum,Alpha,Beta !DIR$ ATTRIBUTES ALIGN : 64 :: AMat,BMat,CMat !Common settings success=SETENVQQ("KMP_AFFINITY=scatter") Alpha=1.0 Beta=0.0 !CPU benchmark NThreads=8 NReps=1000000 call OMP_SET_NUM_THREADS(NThreads) TBegin=OMP_GET_WTIME() Sum=0.0 !$OMP PARALLEL Default(PRIVATE) SHARED(NReps,Alpha,Beta) REDUCTION(+:Sum) AMat(:,:)=0.01 BMat(:,:)=0.02 do i=1,NReps !CMat=matmul(AMat,BMat) CALL DGEMM('N','N',M,N,K,Alpha,AMat,M,BMat,K,Beta,CMat,M) AMat(1,1)=CMat(1,1)*1.01 BMat(1,1)=CMat(1,2)*1.02 end do Sum=Sum+CMat(1,1)+AMat(1,1)+BMat(1,1) !$OMP END PARALLEL TEnd=OMP_GET_WTIME() print *,'Sum=',Sum print *,'CPU GFlops=',2.0*M*N*K*NReps*NThreads/((TEnd-TBegin)*1d9) !MIC benchmark NThreads=224 NReps=100000 !DIR$ OFFLOAD BEGIN TARGET(mic:0) INOUT(AMat,BMat,CMat,NReps,TBegin,TEnd,Sum,NThreads) call OMP_SET_NUM_THREADS(NThreads) TBegin=OMP_GET_WTIME() Sum=0.0 !$OMP PARALLEL Default(PRIVATE) SHARED(NReps,Alpha,Beta) REDUCTION(+:Sum) AMat(:,:)=0.01 BMat(:,:)=0.02 do i=1,NReps !CMat=matmul(AMat,BMat) CALL DGEMM('N','N',M,N,K,Alpha,AMat,M,BMat,K,Beta,CMat,M) AMat(1,1)=CMat(1,1)*1.01 BMat(1,1)=CMat(1,2)*1.02 end do Sum=Sum+CMat(1,1)+AMat(1,1)+BMat(1,1) !$OMP END PARALLEL TEnd=OMP_GET_WTIME() !DIR$ END OFFLOAD print *,'Sum=',Sum print *,'MIC GFlops=',2.0*M*N*K*NReps*NThreads/((TEnd-TBegin)*1d9) end program Console1
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Mkl for coprocessor gets good performance only for large matrices where it is permitted to interleave threads for data movement and computation. Assuming you allow just one thread to each multiplication this can't happen. It may not be surprising if you get less performance than you would with one thread per core.
Offload is particularly inefficient for small problems. The Colfax documents discuss necessary relationship of operations per byte transferred to approach peak in register performance. You will note that the thresholds for automatic offload are orders of magnitude greater than the sizes you have set.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Data transfer time is 100% negligible for what I am doing. I always spend several minutes inside the offload section, doing computations on a single data transfer.
Unfortunately, It is not possible for me to change the core parallelization. I am stuck with using a single thread for each DGEMM. So I guess you are saying the only way I can improve performance is if I can somehow crank up the matrix size?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As Tim pointed out, the matrix sizes are too small for Phi.
Also, Automatic Offload threshold is also around 1280 and more for M, N and greater than 256 for K.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page