How to run Intel® Optimized MP LINPACK Benchmark on KNL platform?

danquxunhuan · ‎02-19-2017

My KNL platform is based on Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz, 1 node, 64 cores,64GB memory. I have some problems in linpack benchmark.

Before I use Intel® Optimized MP LINPACK Benchmark for Clusters, I have used HPL 2.2 and Intel Optimized MP LINPACK Benchmark. In HPL 2.2 and Intel Optimized MP LINPACK Benchmark, the result is bad. The highest result is 486 Gflops when I use HPL 2.2 and 683.6404 Gflops when I use Intel Optimized MP LINPACK Benchmark. However, the theoretical peak performance is 1*64*1.3*32=2662.4 Gflops.

So I am confused. This result looks like the AVX512 is not used? Where am i doing it wrong?

In HPL 2.2 test, I set the N=82800, NB=336, P=4 ,Q=16, and "mpiexec -n 64 ./xhpl", I get the best result (486 Gflops) in HPL 2.2. I also test N=82800, NB=336, P=8 ,Q=32, and "mpiexec -n 256 ./xhpl", but because of no enough memory, the result is low.

I try to use Intel® Optimized MP LINPACK Benchmark for Clusters now. But I get trouble on run it. If I run a small test, such as

mpiexec -np 8 ./xhpl -n 10000 -b 336 -p 2 -q 4

I can get a result.

Even if I enlarge the N and Nb, such as

mpiexec -np 32 ./xhpl -n 83000 -b 336 -p 4 -q 8

I can get a result too.

But when I set the p*q=64 or more, some problem happened.

[root@knl mp_linpack]$ mpiexec -np 64 ./xhpl -n 83000 -b 336 -p 4 -q 16
Number of Intel(R) Xeon Phi(TM) coprocessors : 0
Rank 0: First 5 column_factors=1 1 1 1 1
HPL[knl] pthread_create Error in HPL_pdupdate.

The test is closed directly.

So what should I do to get the higher result in linpack benchmark?

Thanks!

Murat_G_Intel · ‎02-21-2017

Could you try running the benchmark without the mpiexec? On a single node, we do not need to use multiple MPI processes to get the best performance. You could try something like this:

./xhpl -n 83000 -b 336

Then, when you go to multi-node, please use 1 MPI process per node for the KNL systems.

Thank you.

MChun4 · ‎02-21-2017

Dear all,

I have a problem with the result of MKL MP_Linkpack. In my system, I have 24 compute nodes with both Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz and Xeon Phi Q7200, RAM 256GB. On each node, I run ./runme_intel64, the performance is good ~ 700-900 GFlops (only Xeon CPU).

But when I run HPL on 4 nodes, 8 nodes or more, the result is very bad, sometimes it cannot return the result with the error: MPI TERMINATED,... After that, I run the test (runme_intel64) on each node again, and the performance is very low:

~ 11,243 GFLops,

~ 10,845 GFlops,

....

But I don't know the reason why, I guess the reason is power of cluster (it is not enough for a whole system) and HPE Bios configured is Balanced Mode for the cluster (automatically change to lower power mode when the system cannot get enough the power). But when I just run on some nodes and configure the power is maximum, the problem is still not solved.

Please help me about this problem, thank you all!

danquxunhuan · ‎02-22-2017

Thank you for your answer.

I followed your advice and had a try with (work mp_linpack)

./xhpl -n 83000 -b 336

and The result I got was 716.506Gflops.

This result is the best I've ever had. But the theoretical performance a single Intel(R) Xeon Phi(TM) CPU 7210 node is 2662.4Gflops. There is still a big gap between this result and the theoretical performance.

When I was running the Linpack test program, I tried using the monitoring software. And I found that the use rate of CPU is only about 25%. I learned from a material that When I running HPL on this platform, I can use all 256 threads. But after I used the "top" command, I found that the "xhpl" process has used 6400% CPU. And I had tried set the environment by

export OMP_NUM_THREADS=256
export MKL_NUM_THREADS=256

in order to change threads number, but I failed. The result did not change.

I remenber that when I use "mpiexec -np 64 ./xhpl " in HPL2.2, this HPL program establishes 64 threads, each process using a 100% CPU. And when I use "mpiexec -np 256 ./xhpl " in HPL2.2, I find that this program establishes 256 threads, each process using a 100% CPU. Although they both can't get idea result.

It seems that this process only used 64 threads? How can I use all of threads? Or what should I do to get the higher result in linpack benchmark?

And another problem. Is use multiple MPI processes commands caused the error blow?

[root@knl mp_linpack]$ mpiexec -np 64 ./xhpl -n 83000 -b 336 -p 4 -q 16
Number of Intel(R) Xeon Phi(TM) coprocessors : 0
Rank 0: First 5 column_factors=1 1 1 1 1
HPL[knl] pthread_create Error in HPL_pdupdate.

Thanks.

Murat Efe Guney (Intel) wrote:

Could you try running the benchmark without the mpiexec? On a single node, we do not need to use multiple MPI processes to get the best performance. You could try something like this:

./xhpl -n 83000 -b 336

Then, when you go to multi-node, please use 1 MPI process per node for the KNL systems.

Thank you.

SergeyKostrov · ‎02-23-2017

>>... I learned from a material that When I running HPL on this platform, I can use all 256 threads... A top performance on your KNL system will be when only 64 cores and 64 OpenMP threads are used ( spread across all cores ). That is, ... export OMP_NUM_THREADS=64 export MKL_NUM_THREADS=64 ... need to be executed instead. Also, try to set KMP_AFFINITY environment variable to: ... export KMP_AFFINITY=scatter or export KMP_AFFINITY=scatter,verbose ... With KMP_AFFINITY set to compact or balanced modes performance could be worse when compared to scatter mode. I recommend you to test all of these modes.

danquxunhuan · ‎02-26-2017

Thank you for your answer.

In a training course I learned that "KNL support 4 threads, in other words, this 7210 KNL can run up to 256 MPI threads." "The hardware setting is the best for HPL test currently."

So I said I can use all 256 threads.

And I tried to set

export OMP_NUM_THREADS=64
export MKL_NUM_THREADS=64

and run (mp_linpack) Intel® Optimized MP LINPACK Benchmark for Clusters, the result is still bad.

I try to google KMP_AFFINITY and I find this is a environment variables for OPENMP not MPI. The software environment is Intel composer, Intel MPI and Intel MKL.

I also try it. But it seems still not works.( The results have some fluctuate, but there are still a big gap between this results and the theoretical performance.)

I have some questions.

1. As you say,

"A top performance on your KNL system will be when only 64 cores and 64 OpenMP threads are used ( spread across all cores )"

So how can I run the mp_linpack(Intel® Optimized MP LINPACK Benchmark for Clusters https://software.intel.com/en-us/node/528619) ? "./xhpl -n 83000 -b 336 " or "mpiexec -np 64 ./xhpl -n 83000 -b 336 -p 4 -q 16" or other?

2. I also tested the HPL2.2 and Intel® Optimized LINPACK Benchmark for Linux* (which runs on a single platform, https://software.intel.com/en-us/node/528615), but the result is still not good.

For HPL2.2, do you know how to test it ?

"mpiexec -np 64 ./xhpl" and in HPL.dat, N=83000 Nb=336 P=4 Q=16

or

"mpiexec -np 256 ./xhpl" and in HPL.dat, N=83000 Nb=336 P=8 Q=32

or other?

And in the Intel® Optimized LINPACK Benchmark for Linux* Developer Guide , there are only brief introduction, no input files introduction. How to test it?

3.The theoretical performance to a single Intel(R) Xeon Phi(TM) CPU 7210 node is 2662.4Gflops. But the top result I get is 716.506Gflops. This confused me much. How to get close to the theoretical performance?

Thanks.

Sergey Kostrov wrote:

>>... I learned from a material that When I running HPL on this platform, I can use all 256 threads...

A top performance on your KNL system will be when only 64 cores and 64 OpenMP threads are used ( spread across all cores ). That is,
...
export OMP_NUM_THREADS=64
export MKL_NUM_THREADS=64
...
need to be executed instead.

Also, try to set KMP_AFFINITY environment variable to:
...
export KMP_AFFINITY=scatter
or
export KMP_AFFINITY=scatter,verbose
...

With KMP_AFFINITY set to compact or balanced modes performance could be worse when compared to scatter mode. I recommend you to test all of these modes.

Duo_S_ · ‎02-28-2017

I've used the official version of hpl-2.2 on a dual nodes Phi 7230 HPC 3 months ago, and I reached 3888.69GFlops(Single node 2123GFlops). thus I think I have some knowledge of configure and optimize hpl-2.2 on KNL platform. I'm sorry to see you got bad performance (486GFlops), but I prefer to assert it was caused by bad compile optimization. show me your specific configuration of Make.intel64 file. Maybe I can help you...who knows?..

danquxunhuan · ‎02-28-2017

Thank you for your answer.

My KNL platform is based on Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz, 1 node, 64 cores and 64GB memory.(may add the extra 16G eDRAM memory in KNL?)

My software environment is Intel composer, Intel MPI and Intel MKL.

My top result tested by HPL2.2 is 486Gflops, with N=82800, NB=336, P=4 ,Q=16 in HPL.dat.

Here is my Make.intel64 file:

#  
#  -- High Performance Computing Linpack Benchmark (HPL)                
#     HPL - 2.2 - February 24, 2016                          
#     Antoine P. Petitet                                                
#     University of Tennessee, Knoxville                                
#     Innovative Computing Laboratory                                 
#     (C) Copyright 2000-2008 All Rights Reserved                       
#                                                                       
#  -- Copyright notice and Licensing terms:                             
#                                                                       
#  Redistribution  and  use in  source and binary forms, with or without
#  modification, are  permitted provided  that the following  conditions
#  are met:                                                             
#                                                                       
#  1. Redistributions  of  source  code  must retain the above copyright
#  notice, this list of conditions and the following disclaimer.        
#                                                                       
#  2. Redistributions in binary form must reproduce  the above copyright
#  notice, this list of conditions,  and the following disclaimer in the
#  documentation and/or other materials provided with the distribution. 
#                                                                       
#  3. All  advertising  materials  mentioning  features  or  use of this
#  software must display the following acknowledgement:                 
#  This  product  includes  software  developed  at  the  University  of
#  Tennessee, Knoxville, Innovative Computing Laboratory.             
#                                                                       
#  4. The name of the  University,  the name of the  Laboratory,  or the
#  names  of  its  contributors  may  not  be used to endorse or promote
#  products  derived   from   this  software  without  specific  written
#  permission.                                                          
#                                                                       
#  -- Disclaimer:                                                       
#                                                                       
#  THIS  SOFTWARE  IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
#  ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,  INCLUDING,  BUT NOT
#  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
#  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY
#  OR  CONTRIBUTORS  BE  LIABLE FOR ANY  DIRECT,  INDIRECT,  INCIDENTAL,
#  SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL DAMAGES  (INCLUDING,  BUT NOT
#  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
#  DATA OR PROFITS; OR BUSINESS INTERRUPTION)  HOWEVER CAUSED AND ON ANY
#  THEORY OF LIABILITY, WHETHER IN CONTRACT,  STRICT LIABILITY,  OR TORT
#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
#  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
# ######################################################################
#  
# ----------------------------------------------------------------------
# - shell --------------------------------------------------------------
# ----------------------------------------------------------------------
#
SHELL        = /bin/sh
#
CD           = cd
CP           = cp
LN_S         = ln -fs
MKDIR        = mkdir -p
RM           = /bin/rm -f
TOUCH        = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH         = Linux_Intel64
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
TOPdir       = /home/user002/benchmark/hpl-2.2
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
#
HPLlib       = $(LIBdir)/libhpl.a
#
# ----------------------------------------------------------------------
# - Message Passing library (MPI) --------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the  C  compiler where to find the Message Passing library
# header files,  MPlib  is defined  to be the name of  the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
MPdir        = /opt/intel/compilers_and_libraries_2017.1.132/linux/mpi
MPinc        = -I$(MPdir)/include64
MPlib        = $(MPdir)/lib64/libmpi.a
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the  C  compiler where to find the Linear Algebra  library
# header files,  LAlib  is defined  to be the name of  the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
LAdir        = /opt/intel/compilers_and_libraries_2017.1.132/linux/mkl
ifndef  LAinc
LAinc        = $(LAdir)/include
endif
ifndef  LAlib
LAlib        = -L$(LAdir)/lib/intel64 \
               -Wl,--start-group \
               $(LAdir)/lib/intel64/libmkl_intel_lp64.a \
               $(LAdir)/lib/intel64/libmkl_intel_thread.a \
               $(LAdir)/lib/intel64/libmkl_core.a \
               -Wl,--end-group -lpthread -ldl
endif
#
# ----------------------------------------------------------------------
# - F77 / C interface --------------------------------------------------
# ----------------------------------------------------------------------
# You can skip this section  if and only if  you are not planning to use
# a  BLAS  library featuring a Fortran 77 interface.  Otherwise,  it  is
# necessary  to  fill out the  F2CDEFS  variable  with  the  appropriate
# options.  **One and only one**  option should be chosen in **each** of
# the 3 following categories:
#
# 1) name space (How C calls a Fortran 77 routine)
#
# -DAdd_              : all lower case and a suffixed underscore  (Suns,
#                       Intel, ...),                           [default]
# -DNoChange          : all lower case (IBM RS6000),
# -DUpCase            : all upper case (Cray),
# -DAdd__             : the FORTRAN compiler in use is f2c.
#
# 2) C and Fortran 77 integer mapping
#
# -DF77_INTEGER=int   : Fortran 77 INTEGER is a C int,         [default]
# -DF77_INTEGER=long  : Fortran 77 INTEGER is a C long,
# -DF77_INTEGER=short : Fortran 77 INTEGER is a C short.
#
# 3) Fortran 77 string handling
#
# -DStringSunStyle    : The string address is passed at the string loca-
#                       tion on the stack, and the string length is then
#                       passed as  an  F77_INTEGER  after  all  explicit
#                       stack arguments,                       [default]
# -DStringStructPtr   : The address  of  a  structure  is  passed  by  a
#                       Fortran 77  string,  and the structure is of the
#                       form: struct {char *cp; F77_INTEGER len;},
# -DStringStructVal   : A structure is passed by value for each  Fortran
#                       77 string,  and  the  structure is  of the form:
#                       struct {char *cp; F77_INTEGER len;},
# -DStringCrayStyle   : Special option for  Cray  machines,  which  uses
#                       Cray  fcd  (fortran  character  descriptor)  for
#                       interoperation.
#
F2CDEFS      = -DAdd__ -DF77_INTEGER=int -DStringSunStyle
#
# ----------------------------------------------------------------------
# - HPL includes / libraries / specifics -------------------------------
# ----------------------------------------------------------------------
#
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) -I$(LAinc) $(MPinc)
HPL_LIBS     = $(HPLlib) $(LAlib) $(MPlib)
#
# - Compile time options -----------------------------------------------
#
# -DHPL_COPY_L           force the copy of the panel L before bcast;
# -DHPL_CALL_CBLAS       call the cblas interface;
# -DHPL_CALL_VSIPL       call the vsip  library;
# -DHPL_DETAILED_TIMING  enable detailed timers;
#
# By default HPL will:
#    *) not copy L before broadcast,
#    *) call the BLAS Fortran 77 interface,
#    *) not display detailed timing information.
#
#HPL_OPTS     = -DHPL_DETAILED_TIMING -DHPL_PROGRESS_REPORT
HPL_OPTS     = -DASYOUGO -DHYBRID
#
# ----------------------------------------------------------------------
#
HPL_DEFS     = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
#
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
CC       = mpiicc
CCNOOPT  = $(HPL_DEFS) -O0 -w -nocompchk
#OMP_DEFS = -openmp
#CCFLAGS  = $(HPL_DEFS) -O3 -w -z noexecstack -z relro -z now -nocompchk -Wall
CCFLAGS = $(HPL_DEFS) -O3  -w -ansi-alias -i-static -z noexecstack -z relro -z now -openmp -nocompchk
#
# On some platforms,  it is necessary  to use the Fortran linker to find
# the Fortran internals used in the BLAS library.
#
LINKER       = $(CC)
#LINKFLAGS    = $(CCFLAGS) $(OMP_DEFS) -mt_mpi
LINKFLAGS    = $(CCFLAGS) -openmp -mt_mpi $(STATICFLAG) -nocompchk
#
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo
#
# ----------------------------------------------------------------------

I'm hoping your answer. You are so warm! Thanks!

doctor_duo_sim wrote:

I've used the official version of hpl-2.2 on a dual nodes Phi 7230 HPC 3 months ago, and I reached 3888.69GFlops(Single node 2123GFlops). thus I think I have some knowledge of configure and optimize hpl-2.2 on KNL platform. I'm sorry to see you got bad performance (486GFlops), but I prefer to assert it was caused by bad compile optimization. show me your specific configuration of Make.intel64 file. Maybe I can help you...who knows?..

Duo_S_ · ‎02-28-2017

...I've checked but it seems nothing wrong about your Make.intel64 file, I'm afraid that I can't figure out why you get so poor score . you can refer to this page, it may help you, as for the 16GB HBM, you should restart the sever and check your bios to make sure you have set it as cache to get best performance.

danquxunhuan · ‎03-01-2017

Thanks for your answer.

You are so warm!

I will refer this page to restart HPL test again.

And I hope you can give me some help about runing HPL.

Can you show me your HPL.dat and HPL.out contents with your best result? And your running command such as "mpirun -np 64 ./xhpl" or others? Or any other running settings such as environment variable setting?

I think this may help me to solve this problem.

Thanks!

Duo S. wrote:

...I've checked but it seems nothing wrong about your Make.intel64 file, I'm afraid that I can't figure out why you get so poor score . you can refer to this page, it may help you, as for the 16GB HBM, you should restart the sever and check your bios to make sure you have set it as cache to get best performance.

SergeyKostrov · ‎03-01-2017

>>...My top result tested by HPL2.2 is 486Gflops, with N=82800, NB=336, P=4 ,Q=16 in HPL.dat. 1. You're using an option NB=336 and this is a recommended default value for a KNL system with 72 cores. Could you try a value NB=256 instead? 2. I've executed a micprun benchmark test and take a look at my report attached. Here are some numbers: ... [ DESCRIPTION ] 7680 x 7680 MKL DGEMM with 0 threads and 3 iterations [ PERFORMANCE ] Task.Computation.Avg 1874.40 GFlops R ... [ DESCRIPTION ] hpcg Local Dimensions nx=160, ny=160, nz=160, MPI ranks 4, threads per rank 32 [ PERFORMANCE ] Computation.Avg 42.2244 GFlops R ... [ DESCRIPTION ] HPLinpack problem size 100000 block size 336 [ PERFORMANCE ] Computation.Avg 1709.2 GFlops R ... Performance Summary (GFlops) Size LDA Align. Average Maximal 40960 41024 4 1682.3623 1684.5587 ... [ DESCRIPTION ] 40960 x 40960 MKL DP LINPACK with 64 threads and 3 iterations [ PERFORMANCE ] Computation.Avg 1682.3623 GFlops R ... testing XGEMM( 'N', 'N', n, n, ... ) n min avg max stddev 15872 3944.35 3950.08 3960.58 7.421e+00 * 15872 3944.35 3950.08 3960.58 7.421e+00 [ DESCRIPTION ] 15872 x 15872 MKL SGEMM with 0 threads and 3 iterations [ PERFORMANCE ] Task.Computation.Avg 3950.08 GFlops R ...

SergeyKostrov · ‎03-01-2017

>>... >>The highest result is 486 Gflops when I use HPL 2.2 and 683.6404 Gflops when I use Intel Optimized MP LINPACK Benchmark. >>However, the theoretical peak performance is 1*64*1.3*32=2662.4 Gflops... >>... I'm not surprized that real GFlops numbers for a KNL system are lower ( nothing is wrong with it! ) and it could be due too many reasons. That simple calculation, I mean 1*64*1.3*32=2662.4 Gflops, doesn't take into account performance overheads from internal ( some time is spent to execute Non FPU instructions ) and external sources ( different OS services, etc, running at the same time ). Here is a set of results for a KNL server for a matrix multiplication using MKL's sgemm function: [ 16384 x 16384 ] Peak: 1442.51 GFlops [ 32768 x 32768 ] Peak: 1455.22 GFlops [ 65536 x 65536 ] Peak: 1477.93 GFlops [ 81920 x 81920 ] Peak: 1347.65 GFlops [ 98304 x 98304 ] Peak: 1287.24 GFlops [ 114688 x 114688 ] Peak: 1332.90 GFlops Tests completed on: Intel Xeon Phi Processor 7210 ( 16GB, 1.30 GHz, 64 core ) Processor name : Intel(R) Xeon Phi(TM) 7210 Packages (sockets) : 1 Cores : 64 Processors (CPUs) : 256 Cores per package : 64 Threads per core : 4 RAM: 96GB MCDRAM: 16 GB Cluster mode: All2All MCDRAM mode: Flat Environment variables: KMP_AFFINITY=scatter Operating system: Red Hat Enterprise Linux 3.10.0-327.13.1

SergeyKostrov · ‎03-01-2017

Duo S wrote: >... >>...and I reached 3888.69 GFlops >... It looks too high and I recommend to complete an MKL based verification. You know that Intel's MKL API is highly optimized to get a peak performance for a system.

SergeyKostrov · ‎03-01-2017

>>That simple calculation, I mean 1*64*1.3*32=2662.4 Gflops, doesn't take into account performance overheads from internal ( some >>time is spent to execute Non FPU instructions ) and external sources ( different OS services, etc, running at the same time ). On Windows operating system I've done some performance evaluation in a Safe Mode when number of running OS services is minimal ( less than 10 ).

CPati2 · ‎08-28-2017

Hi Sergey,

I am on same platform as yours Intel Xeon Phi Processor 7210 (16GB, 1.30 GHz, 64 core). What I want to observe is thread based performance impact starting from 1 thread mapped to 1 core (rest turned off), then 2 thread mapped to 2 different cores (rest turned off) ....so on... to 256 threads mapped to 64 core different cores. For initial analysis, I can do away with mapping thread to core, but want to have specific number of threads based on number of active cores.

For such test, which benchmark would you suggest and what I should be aware about? I tried DeepBench, but need to figure out how to make use of threading in it.

Thanks.

McCalpinJohn · ‎08-31-2017

Estimating "peak" performance on KNL is a bit tricky... For my Xeon Phi 7250 processors (68-core, 1.4 GHz nominal), the guaranteed frequency running AVX-512-heavy code is 1.2 GHz. The Xeon Phi core is also a 2-instruction-issue core, but peak performance requires 2 FMA's per cycle -- so any instruction that is not an FMA is a direct subtraction from the maximum available performance. It is difficult to be precise, but it is hard to imagine any encoding of the *DGEMM kernel that does not contain about 20% non-FMA instructions.

So a ballpark double-precision "adjusted peak" for the Xeon Phi 7250 is

68 cores * 1.2 GHz * 32 DP FP Ops/Hz * 80% FMA density = 2089 GFLOPS

For DGEMM problems that can fit all three arrays into MCDRAM (in flat-quadrant mode), I have seen performance of just over 2000 GFLOPS. I don't understand why, but these runs maintain an average frequency that is significantly higher than 1.2 GHz -- close to 1.4 GHz. The observed performance is ~85% of the "adjusted peak" performance at the observed frequency, which seems pretty reasonable.

HPL execution is dominated by DGEMM, but the overall algorithm is much more complex. Unlike DGEMM, when I run HPL on KNL I do see frequencies close to the expected power-limited 1.2 GHz. Also unlike DGEMM, when I run HPL I find that the KNL does not reach asymptotic performance for problem sizes that fit into the MCDRAM memory. To get asymptotic performance for larger problems, you need to either run with the MCDRAM in cached mode, or you need an implementation that explicitly stages the data (in large blocks) through MCDRAM. If I recall correctly, asymptotic HPL performance on KNL requires array sizes of at least 50-60 GiB. On clusters, even larger sizes (per KNL) are needed to minimize overhead due to inter-node MPI communication.

Tuyen__Nguyen · ‎08-31-2019

McCalpin, John (Blackbelt) wrote:
Estimating "peak" performance on KNL is a bit tricky...   For my Xeon Phi 7250 processors (68-core, 1.4 GHz nominal), the guaranteed frequency running AVX-512-heavy code is 1.2 GHz. The Xeon Phi core is also a 2-instruction-issue core, but peak performance requires 2 FMA's per cycle -- so any instruction that is not an FMA is a direct subtraction from the maximum available performance.   It is difficult to be precise, but it is hard to imagine any encoding of the *DGEMM kernel that does not contain about 20% non-FMA instructions.
So a ballpark double-precision "adjusted peak" for the Xeon Phi 7250 is
68 cores * 1.2 GHz * 32 DP FP Ops/Hz * 80% FMA density = 2089 GFLOPS
For DGEMM problems that can fit all three arrays into MCDRAM (in flat-quadrant mode), I have seen performance of just over 2000 GFLOPS. I don't understand why, but these runs maintain an average frequency that is significantly higher than 1.2 GHz -- close to 1.4 GHz.   The observed performance is ~85% of the "adjusted peak" performance at the observed frequency, which seems pretty reasonable.
HPL execution is dominated by DGEMM, but the overall algorithm is much more complex. Unlike DGEMM, when I run HPL on KNL I do see frequencies close to the expected power-limited 1.2 GHz.   Also unlike DGEMM, when I run HPL I find that the KNL does not reach asymptotic performance for problem sizes that fit into the MCDRAM memory. To get asymptotic performance for larger problems, you need to either run with the MCDRAM in cached mode, or you need an implementation that explicitly stages the data (in large blocks) through MCDRAM.   If I recall correctly, asymptotic HPL performance on KNL requires array sizes of at least 50-60 GiB. On clusters, even larger sizes (per KNL) are needed to minimize overhead due to inter-node MPI communication.

Dear McCalpin, John (Blackbelt),

Your information supports really helpful. My architecture is also Xeon Phi 7250 (68-core, 1.4 GHz), but the performance I got for executing HPL just 804 Gflops per 68 cores.

Could you help me explain detail about your guide using MCDRAM? That is I have to set environment variables or I need to modify source code to using MCDRAM memory?

Or could you guide me tunning some parameters in HPL.dat to get good performance?

I hope to hear from you soon.

Thanks a lot.