Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Benchmark With Broadwell

Atul_Y_
Beginner
520 Views

Hi Team,

Need help to achieve the optimal result:

E5-2697 v4 @ 2.30GHz  AVX 2.00 GHz

2.3 * 36 * 16 = 1324  ( TDP )
2.0 * 36 * 16  = 1152  ( AVX )

Intel(R) MPI Library for Linux* OS, Version 2017 Update 3 Build 20170405 (id: 17193)
Linux master.local 3.10.0-693.5.2.el7.x86_64
CentOS Linux release 7.4.1708 (Core)

Two Node Result 

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      231168   192     8     9            7936.44            1.03770e+03

mpirun  -print-rank-map -np  72  -genv I_MPI_DEBUG 5 -genv I_MPI_FALLBACK_DEVICE 0 -genv I_MPI_FABRICS shm:dapl --machinefile $PBS_NODEFILE  /opt/apps/intel/mkl/benchmarks/mp_linpack/xhpl_intel64_static


Single node Performance
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      163200   192     6     6            4123.17            7.02820e+02

 

Need your support

 

Thank You

 

0 Kudos
1 Reply
McCalpinJohn
Honored Contributor III
520 Views

Out of curiosity, where did you find the AVX frequency specification for this processor?  I can find the information for Xeon E5 v3 and for Xeon Scalable processors, but I can't find it for Xeon E5 v4 processors....

My guess would be that your performance is low because you are using too many MPI ranks.  Using the hybrid mode with one MPI task per socket typically gives the best performance.   I can't recall whether or not you need to specify the thread count for this version of the code, but I would go ahead and set MKL_NUM_THREADS=18 to explicitly request one thread per core.  (The code looks at a number of variables for thread count, but I have not seen any documentation of the precedence/priority of these variables, or whether there are differences in semantics.)

0 Kudos
Reply