- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I was running two subsequent dgemm operations: T=AB and C=A'T with A=(56,000x400,000), B=(400,000x30), T=(56,000x30) and C=B.
Conditional on the CPU I measured these wall clock times (for the dgemm operations only):
Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz with 36 (real) cores, 46080 KB cache, 250GB of RAM
T=AB: 3.73 seconds,
C=A'T: 4.17 seconds
Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz with 56 (real) cores, 19712 KB cache, 2TB of RAM
T=AB: 91.47 seconds
C=A'T: 232.78 seconds
What was paticularly striking was that T=AB used all 56 cores, whereas C=A'T used only half of it.
kmp setting was: KMP_AFFINITY=compact,1,0,granularity=fine
I am wondering whether the bad performance of the latter is solely attributable to its architecture and therefore is set in stone, or whether I can somehow optimize mkl/kmp environment variables to increase performance.
Thanks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
56k x 30 x 400k with 3.73 seconds corresponds to 3603GFlop/s. Since expected maximum DGEMM performance would be around 1400 - 1500GF and that performance isn't realistic. Can you check system condition and computed results on E5-2697v4?
Xeon Gold performance is around 146GFlop/s, but it would be reasonable by taking account into problem shape.
Thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
could you export MKL_VERBOSE=1 and give the output?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Gennady
Thanks for the response.
Gennady F. (Intel) wrote:could you export MKL_VERBOSE=1 and give the output?
I'll post results once I have done it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kazushige
Kazushige G. (Intel) wrote:Hi,
56k x 30 x 400k with 3.73 seconds corresponds to 3603GFlop/s. ..................
Just to get the numbers right: 56k x 30 x 400k * 2 / 10^9=1344 GFLOPS. Divided by 3.73 seconds equals 360.3 GFlops per
second, which is then about 24% of the peak performance of 1500 GFLOPS per second.
The computed results are correct.
What do you mean with system conditions?
The operation system is Arch Linux, Kernel version 4.20.6.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MKL_VERBOSE output for E5-2697 v4:
MKL_VERBOSE Intel(R) MKL 2017.0 Update 4 Product build 20170811 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.30GHz ilp64 intel_thread NMICDev:0 MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffd287b7be8,0x148e2a60a280,56000,0x1469e63ff300,400000,0x7ffd287b7bf0,0x148cae0fd340,56000) 4.19s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:36 WDiv:HOST:+0.000 1: 5.03400000000000 MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffd287b7be8,0x148e2a60a280,56000,0x148cae0fd340,56000,0x7ffd287b7bf0,0x1469e07fe380,400000) 4.81s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:36 WDiv:HOST:+0.000 2: 4.81500000000000
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MKL_VERBOSE Intel(R) MKL 2017.
Could you try to evaluate the latest MKL 2019 u1!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Thanks for the response.
I could not try 2019 u1 because the fortran compiler is (still) super buggy (the same for 2018). I am just downloading 2019 update 2 and will try that. I'll get back to you then.
Cheers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ok, please try this. FYI regard to Fortran Compiler - the latest 2019 u2 contains only some security updates. You may see more details follow the release notes: https://software.intel.com/en-us/articles/intel-fortran-compiler-190-for-linux-release-notes-for-intel-parallel-studio-xe-2019:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
so I made a stand-alone program which the ifort 19 compiler can cope with:
Program Test use blas95 USE IFPORT implicit none integer(kind=8) :: nM,nG,nE integer(kind=4) :: istat, i character(len=200) :: msg Real(kind=8), allocatable :: A(:,:), B(:,:), C(:,:),T(:,:) real(kind=8) :: r1=0.0D0, r2=0.0D0 outer:block nM=56000;nG=400000;nE=30;istat=0 write(*,"(*(g0"",""))") nM,nG,nE r1=dclock() allocate(& &A(nM,nG),& &B(nG,nE),& &T(nM,nE),& &C(nG,nE),& &stat=istat,errmsg=msg) if(istat/=0) Then write(*,*) msg;exit outer end if !$OMP PARALLEL !$OMP DO Do i=1,nG A(:,i)=0.0D0 end Do !$OMP END DO NOWAIT !$OMP DO Do i=1,nE B(:,i)=0.0D0 T(:,i)=0.0D0 C(:,i)=0.0D0 end Do !$OMP END DO !$OMP END PARALLEL r2=Dclock() write(*,*) "alloc: ", r2-r1 Do i=1,5 r2=Dclock() call gemm(A=A,B=B,C=T) r1=dclock() write(*,*) "1: ",r1-r2 call gemm(A=A,B=T,C=C,transa="T") r2=dclock() write(*,*) "2: ",r2-r1 End Do End block outer End Program Test
Further settings are:
MKL_VERBOSE=1
KMP_AFFINITY=compact,1,0,granularity=fine
The bash output from E5-2697 v4 is:
56000,400000,30, OMP: Warning #181: OMP_PLACES: ignored because KMP_AFFINITY has been defined OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined OPENMP DISPLAY ENVIRONMENT BEGIN _OPENMP='201611' [host] OMP_AFFINITY_FORMAT='OMP: pid %P tid %T thread %n bound to OS proc set {%a}' [host] OMP_ALLOCATOR='omp_default_mem_alloc' [host] OMP_CANCELLATION='FALSE' [host] OMP_DEBUG='disabled' [host] OMP_DEFAULT_DEVICE='-10' [host] OMP_DISPLAY_AFFINITY='FALSE' [host] OMP_DISPLAY_ENV='TRUE' [host] OMP_DYNAMIC='FALSE' [host] OMP_MAX_ACTIVE_LEVELS='2147483647' [host] OMP_MAX_TASK_PRIORITY='0' [host] OMP_NESTED='TRUE' [host] OMP_NUM_THREADS='72' [host] OMP_PLACES: value is not defined [host] OMP_PROC_BIND='intel' [host] OMP_SCHEDULE='static' [host] OMP_STACKSIZE='2000M' OMP_TARGET_OFFLOAD=DEFAULT [host] OMP_THREAD_LIMIT='2147483647' [host] OMP_TOOL='enabled' [host] OMP_TOOL_LIBRARIES: value is not defined [host] OMP_WAIT_POLICY='PASSIVE' OPENMP DISPLAY ENVIRONMENT END alloc: 12.6054401397705 MKL_VERBOSE Intel(R) MKL 2019.0 Update 1 Product build 20180928 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.30GHz ilp64 intel_thread MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d61d6240,400000,0x7ffc8ef496d0,0x14c8d54d5280,56000) 4.12s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:36 1: 4.77215790748596 MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d54d5280,56000,0x7ffc8ef496d0,0x14c8cf8d42c0,400000) 4.64s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:36 2: 4.63893508911133 MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d61d6240,400000,0x7ffc8ef496d0,0x14c8d54d5280,56000) 4.07s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:36 1: 4.07306599617004 MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d54d5280,56000,0x7ffc8ef496d0,0x14c8cf8d42c0,400000) 4.63s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:36 2: 4.63029694557190 MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d61d6240,400000,0x7ffc8ef496d0,0x14c8d54d5280,56000) 4.07s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:36 1: 4.07025289535522 MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d54d5280,56000,0x7ffc8ef496d0,0x14c8cf8d42c0,400000) 4.63s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:36 2: 4.63470506668091 MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d61d6240,400000,0x7ffc8ef496d0,0x14c8d54d5280,56000) 4.07s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:36 1: 4.07217407226562 MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d54d5280,56000,0x7ffc8ef496d0,0x14c8cf8d42c0,400000) 4.63s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:36 2: 4.63222002983093 MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d61d6240,400000,0x7ffc8ef496d0,0x14c8d54d5280,56000) 4.07s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:36 1: 4.06569814682007 MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d54d5280,56000,0x7ffc8ef496d0,0x14c8cf8d42c0,400000) 4.63s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:36 2: 4.63313007354736
The bash output from Gold 5117 was:
56000,400000,30, OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined OMP: Warning #181: OMP_PLACES: ignored because KMP_AFFINITY has been defined OPENMP DISPLAY ENVIRONMENT BEGIN _OPENMP='201611' [host] OMP_AFFINITY_FORMAT='OMP: pid %P tid %T thread %n bound to OS proc set {%a}' [host] OMP_ALLOCATOR='omp_default_mem_alloc' [host] OMP_CANCELLATION='FALSE' [host] OMP_DEBUG='disabled' [host] OMP_DEFAULT_DEVICE='-10' [host] OMP_DISPLAY_AFFINITY='FALSE' [host] OMP_DISPLAY_ENV='TRUE' [host] OMP_DYNAMIC='FALSE' [host] OMP_MAX_ACTIVE_LEVELS='2147483647' [host] OMP_MAX_TASK_PRIORITY='0' [host] OMP_NESTED='TRUE' [host] OMP_NUM_THREADS='112' [host] OMP_PLACES: value is not defined [host] OMP_PROC_BIND='intel' [host] OMP_SCHEDULE='static' [host] OMP_STACKSIZE='2000M' OMP_TARGET_OFFLOAD=DEFAULT [host] OMP_THREAD_LIMIT='2147483647' [host] OMP_TOOL='enabled' [host] OMP_TOOL_LIBRARIES: value is not defined [host] OMP_WAIT_POLICY='PASSIVE' OPENMP DISPLAY ENVIRONMENT END alloc: 18.7253189086914 MKL_VERBOSE Intel(R) MKL 2019.0 Update 1 Product build 20180928 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.00GHz ilp64 intel_thread MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3190b43240,400000,0x7ffda0750ed0,0x2b3196744280,56000) 98.58s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:56 1: 98.8509511947632 MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3196744280,56000,0x7ffda0750ed0,0x2b31974452c0,400000) 222.52s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:56 2: 222.525346994400 MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3190b43240,400000,0x7ffda0750ed0,0x2b3196744280,56000) 28.31s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:56 1: 28.3082640171051 MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3196744280,56000,0x7ffda0750ed0,0x2b31974452c0,400000) 73.63s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:56 2: 73.6357510089874 MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3190b43240,400000,0x7ffda0750ed0,0x2b3196744280,56000) 112.45s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:56 1: 112.451642990112 MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3196744280,56000,0x7ffda0750ed0,0x2b31974452c0,400000) 152.85s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:56 2: 152.847929000854 MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3190b43240,400000,0x7ffda0750ed0,0x2b3196744280,56000) 23.87s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:56 1: 23.8693110942841 MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3196744280,56000,0x7ffda0750ed0,0x2b31974452c0,400000) 53.01s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:56 2: 53.0122749805450 MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3190b43240,400000,0x7ffda0750ed0,0x2b3196744280,56000) 43.80s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:56 1: 43.8020231723785 MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3196744280,56000,0x7ffda0750ed0,0x2b31974452c0,400000) 97.87s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:56 2: 97.8659319877625
The compiler flags for the program were:
ifort -i8 -warn nounused -warn declarations -O3 -static -mkl=parallel -qopenmp -parallel -c -o OMP_MKLPARA_ifort_4.20.6-arch1-1-ARCH/Test.o Test.f90 -I /opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/include/ ifort -i8 -warn nounused -warn declarations -O3 -static -mkl=parallel -qopenmp -parallel -o Test_OMP_MKLPARA_4.20.6-arch1-1-ARCH OMP_MKLPARA_ifort_4.20.6-arch1-1-ARCH/Test.o /opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64/libmkl_blas95_ilp64.a /opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64/libmkl_lapack95_ilp64.a -Wl,--start-group /opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64/libmkl_intel_ilp64.a /opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64/libmkl_core.a /opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64/libmkl_intel_thread.a -Wl,--end-group -lpthread -lm -ldl
The Gold 5117 runs a centos 7 operation system, kernel version 3.10.0-862.el7.x86_64
The E5-2697 v4 runs an Arch Linux operation system, kernel version 4.20.6-arch1-1-ARCH
If you need more information let me know.
cheers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
> Just to get the numbers right: 56k x 30 x 400k * 2 / 10^9=1344 GFLOPS. Divided by 3.73 seconds equals 360.3 GFlops per
Sorry, I made a mistake. Yes, 360GFlop/s is the correct number. Then can you check system condition for Xeon Gold? This operation requires high memory bandwidth against main memory. Using stream benchmark (https://www.cs.virginia.edu/stream/) is useful.
icc -qopenmp -o stream stream.c
export KMP_AFFINITY=compact,1,0,granularity=fine
numactl --cpunodebind=0 --membind=0 ./stream
numactl --cpunodebind=0 --membind=1 ./stream
numactl --cpunodebind=1 --membind=0 ./stream
numactl --cpunodebind=1 --membind=1 ./stream
Please check "Copy" performance. Expected performance is around 80 - 95GB/sec for local memory access, and 40 - 50GB/sec for remote access.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
> numactl --cpunodebind=0 --membind=0 ./stream
Function Best Rate MB/s Avg time Min time Max time Copy: 54365.6 0.002975 0.002943 0.002999
> numactl --cpunodebind=0 --membind=1 ./stream
Function Best Rate MB/s Avg time Min time Max time Copy: 23422.1 0.007045 0.006831 0.007223
> numactl --cpunodebind=1 --membind=0 ./stream
Function Best Rate MB/s Avg time Min time Max time Copy: 23025.1 0.007101 0.006949 0.007346
> numactl --cpunodebind=1 --membind=1 ./stream
Function Best Rate MB/s Avg time Min time Max time Copy: 49765.6 0.003226 0.003215 0.003239
> Please check "Copy" performance. Expected performance is around 80 - 95GB/sec for local memory access, and 40 - 50GB/sec for remote access.
About half of what you expect. Possibly because all ram slots are not populated. It runs 32x 64g modules. dmidecode lists a repeating pattern of modules in set1(A1,A2,A4,A5), set2(A7,A8,A10,A11), set3(B1,B2,B4,B5), set4(B7,B8,B10,B11), set5(C1,C2,C4,C5), set6(C7,C8,C10,C11), set7(D1,D2,D4,D5), set8(D7,D8,D10,D11). As delivered from Dell.
Also interesting numactl --cpunodebind=1 --membind=3 ./stream_c
Function Best Rate MB/s Avg time Min time Max time Copy: 17097.4 0.009394 0.009358 0.009473
How many UPI does the 5117 really have? ark.intel.com (it's not on the xeon scalable processor list but google can find its page) says 2, wikichip.org says 3. In quad configuration, does it have to work like a ring?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page