Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

HPC Cluster HPL test error



When I want  to test my HPC system in which has 33 node has 24 core each one in total 792 core and 370GB RAM for each node but I get following error as I secondly run  mpiexec -f hosts2 -n 792 ./xhpl  command. I had run this command before smoothly and got and output. Do you have any idea with this problem?

When first execution  mpiexec -f hosts2 -n 792 ./xhpl  command that produces 3.096e+04 Gflops value

By the way mpiexec -f hosts2 -n 480 ./xhpl command is working properly an produce output 2.136e+04 Gflops value

Thank you.

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  300288 
NB     :     224 
PMAP   : Row-major process mapping
P      :      24 
Q      :      33 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words


- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

Abort(205610511) on node 516 (rank 516 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(507)................: MPI_Comm_split(MPI_COMM_WORLD, color=0, key=516, new_comm=0x7ffc61c8d818) failed
MPIR_Get_contextid_sparse_group(498): Failure during collective
Abort(876699151) on node 575 (rank 575 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:
PMPI_Comm_split(507)................: MPI_Comm_split(MPI_COMM_WORLD, color=0, key=575, new_comm=0x7ffec32d2c18) failed
MPIR_Get_contextid_sparse_group(498): Failure during collective

0 Kudos
5 Replies

Hi Gurcan,

Thanks for connecting. We are working on this and we will get back to you at earliest. 

Meanwhile, could you please share MPI version details with us.



0 Kudos

Hi Kumar,

I have overcome this problem with changing the following bashrc parameters. The result is produced 45 Tflops value.

But when I run the mpiexec -f hosts2 -n 792 ./xhpl. I dont know what is the output value should be? How much Tflops should it be in theorical ? and How to calculate Tflops value?

I look forward to hearing from you soon.


Processor: 2 x Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz 

Number of Cores :2 slot x 12 core

Number of Nodes: 33 


$ mpirun --version 

Intel(R) MPI Library for Linux* OS, Version 2019 Update 3 Build 20190214 (id: b645a4a54)
Copyright 2003-2019, Intel Corporation.


my ./bashrc parameters is ;  

export I_MPI_FABRICS=shm:ofi
export  FI_PROVIDER=verbs
export FI_VERBS_IFACE=ib



0 Kudos

It appears you are using InfiniBand* on this system.  I highly recommend updating to Intel® MPI Library 2019 Update 6 and using FI_PROVIDER=mlx.

0 Kudos
Honored Contributor III

The Xeon Gold 6136 processor has a "base" all-core AVX512 frequency of 2.1 GHz and a maximum all-core AVX512 Turbo frequency of 2.7 GHz.  The average sustained frequency when running xHPL will be somewhere in this range.  The average value will vary across your 66 processors, depending on the leakage current on each processor and on variations in the effectiveness of the cooling system in the different processors locations.   On a set of 1736 2-socket Xeon Platinum 8160 processors, I saw a range of almost 13% in single-node xHPL performance.   The default configuration of xHPL divides the work as uniformly as it can across the nodes, so aggregate performance is typically very close to "worst-node performance times number of nodes".

The efficiency of the xHPL code is very high -- typically 91%-92% of peak relative to the *actual* average frequency.   So at the "base" AVX512 frequency of 2.1 GHz, the peak performance of the system is 

33 nodes * 2 sockets/node * 12 cores/socket * 32 FP ops/cycle/core * 2.1 GHz = 53,222 GFLOPS

At 91% efficiency, this would correspond to an xHPL performance of 48,432 GFLOPS.  Your 45 TFLOPS number is about 7% lower than this estimate.

Performance will depend on many of the parameters in the input file, but the most important one is probably problem size.  For the N=300288 problem size, your run only uses about 10 GiB of memory per socket and only takes about 400 seconds to run.  My projection model says that for this decomposition, performance should be about 45 TFLOPS.  Larger problem sizes will help, but my projection model suggests that you will also have to launch xHPL with only one or two MPI tasks per node (and allow it to parallelize across cores within each MPI task).   For 1 MPI task per socket you will have 66 MPI tasks and P=6, Q=11 looks best in my performance model.

0 Kudos

This thread has been addressed by the community. I am marking it as resolved for Intel support. If you need further Intel support on this issue, please open a new thread. Any further discussion on this thread will be considered community only.

0 Kudos