Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

serttas__gurcan

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-31-2019
12:19 AM

308 Views

HPC Cluster HPL test error

Hello,

When I want to test my HPC system in which has 33 node has 24 core each one in total 792 core and 370GB RAM for each node but I get following error as I secondly run ** mpiexec -f hosts2 -n 792 ./xhpl **command. I had run this command before smoothly and got and output. Do you have any idea with this problem?

**When first execution mpiexec -f hosts2 -n 792 ./xhpl ** command that produces 3.096e+04 Gflops value

By the way **mpiexec -f hosts2 -n 480 ./xhpl command is working properly an produce output **2.136e+04 Gflops value

Thank you.

An explanation of the input/output parameters follows:

T/V : Wall time / encoded variant.

N : The order of the coefficient matrix A.

NB : The partitioning blocking factor.

P : The number of process rows.

Q : The number of process columns.

Time : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 300288

NB : 224

PMAP : Row-major process mapping

P : 24

Q : 33

PFACT : Right

NBMIN : 4

NDIV : 2

RFACT : Crout

BCAST : 1ringM

DEPTH : 1

SWAP : Mix (threshold = 64)

L1 : transposed form

U : transposed form

EQUIL : yes

ALIGN : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.

- The following scaled residual check will be computed:

||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )

- The relative machine precision (eps) is taken to be 1.110223e-16

- Computational tests pass if scaled residuals are less than 16.0

Abort(205610511) on node 516 (rank 516 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:

PMPI_Comm_split(507)................: MPI_Comm_split(MPI_COMM_WORLD, color=0, key=516, new_comm=0x7ffc61c8d818) failed

PMPI_Comm_split(489)................:

MPIR_Comm_split_impl(253)...........:

MPIR_Get_contextid_sparse_group(498): Failure during collective

Abort(876699151) on node 575 (rank 575 in comm 0): Fatal error in PMPI_Comm_split: Other MPI error, error stack:

PMPI_Comm_split(507)................: MPI_Comm_split(MPI_COMM_WORLD, color=0, key=575, new_comm=0x7ffec32d2c18) failed

PMPI_Comm_split(489)................:

MPIR_Comm_split_impl(253)...........:

MPIR_Get_contextid_sparse_group(498): Failure during collective

Link Copied

5 Replies

Neeraj_K_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-02-2020
10:50 PM

308 Views

Hi Gurcan,

Thanks for connecting. We are working on this and we will get back to you at earliest.

Meanwhile, could you please share MPI version details with us.

Thanks,

Neeraj

serttas__gurcan

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-03-2020
06:28 AM

308 Views

Hi Kumar,

I have overcome this problem with changing the following bashrc parameters. The result is produced 45 Tflops value.

But when I run the **mpiexec -f hosts2 -n 792 ./xhpl. **I dont know what is the output value should be? How much Tflops should it be in theorical ? and How to calculate Tflops value?

I look forward to hearing from you soon.

Thanks.

Processor: 2 x Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz

Number of Cores :2 slot x 12 core

Number of Nodes: 33

**$ mpirun --version **

Intel(R) MPI Library for Linux* OS, Version 2019 Update 3 Build 20190214 (id: b645a4a54)

Copyright 2003-2019, Intel Corporation.

**my ./bashrc parameters is ; **

export I_MPI_FABRICS=shm:ofi

export FI_PROVIDER=verbs

export FI_VERBS_IFACE=ib

James_T_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-06-2020
07:16 AM

308 Views

McCalpinJohn

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-08-2020
12:11 PM

308 Views

The Xeon Gold 6136 processor has a "base" all-core AVX512 frequency of 2.1 GHz and a maximum all-core AVX512 Turbo frequency of 2.7 GHz. The average sustained frequency when running xHPL will be somewhere in this range. The average value will vary across your 66 processors, depending on the leakage current on each processor and on variations in the effectiveness of the cooling system in the different processors locations. On a set of 1736 2-socket Xeon Platinum 8160 processors, I saw a range of almost 13% in single-node xHPL performance. The default configuration of xHPL divides the work as uniformly as it can across the nodes, so aggregate performance is typically very close to "worst-node performance times number of nodes".

The efficiency of the xHPL code is very high -- typically 91%-92% of peak relative to the *actual* average frequency. So at the "base" AVX512 frequency of 2.1 GHz, the peak performance of the system is

33 nodes * 2 sockets/node * 12 cores/socket * 32 FP ops/cycle/core * 2.1 GHz = 53,222 GFLOPS

At 91% efficiency, this would correspond to an xHPL performance of 48,432 GFLOPS. Your 45 TFLOPS number is about 7% lower than this estimate.

Performance will depend on many of the parameters in the input file, but the most important one is probably problem size. For the N=300288 problem size, your run only uses about 10 GiB of memory per socket and only takes about 400 seconds to run. My projection model says that for this decomposition, performance should be about 45 TFLOPS. Larger problem sizes will help, but my projection model suggests that you will also have to launch xHPL with only one or two MPI tasks per node (and allow it to parallelize across cores within each MPI task). For 1 MPI task per socket you will have 66 MPI tasks and P=6, Q=11 looks best in my performance model.

James_T_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-04-2020
11:28 AM

250 Views

For more complete information about compiler optimizations, see our Optimization Notice.