Benchmark With Broadwell

Atul_Y_ · ‎03-13-2018

Hi Team,

Need help to achieve the optimal result:

E5-2697 v4 @ 2.30GHz AVX 2.00 GHz

2.3 * 36 * 16 = 1324 ( TDP )
2.0 * 36 * 16 = 1152 ( AVX )

Intel(R) MPI Library for Linux* OS, Version 2017 Update 3 Build 20170405 (id: 17193)
Linux master.local 3.10.0-693.5.2.el7.x86_64
CentOS Linux release 7.4.1708 (Core)

Two Node Result

================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 231168 192 8 9 7936.44 1.03770e+03

mpirun -print-rank-map -np 72 -genv I_MPI_DEBUG 5 -genv I_MPI_FALLBACK_DEVICE 0 -genv I_MPI_FABRICS shm:dapl --machinefile $PBS_NODEFILE /opt/apps/intel/mkl/benchmarks/mp_linpack/xhpl_intel64_static

Single node Performance
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 163200 192 6 6 4123.17 7.02820e+02

Need your support

Thank You

McCalpinJohn · ‎03-13-2018

Out of curiosity, where did you find the AVX frequency specification for this processor? I can find the information for Xeon E5 v3 and for Xeon Scalable processors, but I can't find it for Xeon E5 v4 processors....

My guess would be that your performance is low because you are using too many MPI ranks. Using the hybrid mode with one MPI task per socket typically gives the best performance. I can't recall whether or not you need to specify the thread count for this version of the code, but I would go ahead and set MKL_NUM_THREADS=18 to explicitly request one thread per core. (The code looks at a number of variables for thread count, but I have not seen any documentation of the precedence/priority of these variables, or whether there are differences in semantics.)