Solved: unable to run linpack with intel2020u0 binary (xhpl_intel64_*)

psing51 · ‎04-08-2020

Hi,
I have installed intel2020u0 on RHEL 7.6 based system having intel 8280M processor.
While running a quick test with linpack binary provided under compilers_and_libraries_2020.0.166/linux/mkl/benchmarks/mp_linpack , i end up with issues. Here is how i setup and run the linpack binary (on single node) -

[user@node1 BASELINE]$ cp /home/user/COMPILER/MPI/INTELMPI/2020u0/compilers_and_libraries_2020.0.166/linux/mkl/benchmarks/mp_linpack/xhpl_intel64_static /home/user/COMPILER/MPI/INTELMPI/2020u0/compilers_and_libraries_2020.0.166/linux/mkl/benchmarks/mp_linpack/runme_intel64_static /home/user/COMPILER/MPI/INTELMPI/2020u0/compilers_and_libraries_2020.0.166/linux/mkl/benchmarks/mp_linpack/runme_intel64_prv /home/user/COMPILER/MPI/INTELMPI/2020u0/compilers_and_libraries_2020.0.166/linux/mkl/benchmarks/mp_linpack/HPL.dat .
[puneet@node61 BASELINE]$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2019 Update 6 Build 20191024 (id: 082ae5608)
Copyright 2003-2019, Intel Corporation.
[user@node1 BASELINE]$ ls
HPL.dat  runme_intel64_prv  runme_intel64_static  xhpl_intel64_static
[user@node1 BASELINE]$ ./runme_intel64_static
This is a SAMPLE run script.  Change it to reflect the correct number
of CPUs/threads, number of nodes, MPI processes per node, etc..
This run was done on: Wed Apr  8 22:36:04 IST 2020
RANK=1, NODE=1
RANK=0, NODE=0
Abort(1094543) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(649)......:
MPID_Init(861).............:
MPIDI_NM_mpi_init_hook(953): OFI fi_open domain failed (ofi_init.h:953:MPIDI_NM_mpi_init_hook:No data available)
Abort(1094543) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(649)......:
MPID_Init(861).............:
MPIDI_NM_mpi_init_hook(953): OFI fi_open domain failed (ofi_init.h:953:MPIDI_NM_mpi_init_hook:No data available)
Done: Wed Apr  8 22:36:05 IST 2020
[user@node1 BASELINE]$

now, in the 2020u0 environment, if i remove the xhpl_intel64_static binary and use the one supplied with 2019u5 (HPL 2.3), HPL works fine -

[user@node1 BASELINE]$ cp /home/user/COMPILER/MPI/INTELMPI/2019_U5/compilers_and_libraries_2019.5.281/linux/mkl/benchmarks/mp_linpack/xhpl_intel64_static .
[user@node1 BASELINE]$ ./runme_intel64_static
This is a SAMPLE run script.  Change it to reflect the correct number
of CPUs/threads, number of nodes, MPI processes per node, etc..
This run was done on: Wed Apr  8 22:36:40 IST 2020
RANK=0, NODE=0
RANK=1, NODE=1
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N        :    1000
NB       :     192
PMAP     : Column-major process mapping
P        :       1
Q        :       1
PFACT    :   Right
NBMIN    :       2
NDIV     :       2
RFACT    :   Crout
BCAST    :   1ring
DEPTH    :       0
SWAP     : Binary-exchange
L1       : no-transposed form
U        : no-transposed form
EQUIL    : no
ALIGN    :    8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

node1          : Column=000192 Fraction=0.005 Kernel=    0.00 Mflops=100316.35
node1          : Column=000384 Fraction=0.195 Kernel=65085.04 Mflops=83075.67
node1          : Column=000576 Fraction=0.385 Kernel=39885.67 Mflops=70127.11
node1          : Column=000768 Fraction=0.595 Kernel=17659.92 Mflops=58843.41
node1          : Column=000960 Fraction=0.795 Kernel= 4894.70 Mflops=51756.17
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC00C2R2        1000   192     1     1               0.01            4.64944e+01
HPL_pdgesv() start time Wed Apr  8 22:36:41 2020

HPL_pdgesv() end time   Wed Apr  8 22:36:41 2020

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0059446 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================
Done: Wed Apr  8 22:36:41 IST 2020

same is the case with the xhpl binary supplied with intel 2018u4 (HPLv2.1)

[user@node1 BASELINE]$ cp /home/user/COMPILER/MPI/INTELMPI/2018_U4/compilers_and_libraries_2018.5.274/linux/mkl/benchmarks/mp_linpack/xhpl_intel64_static .
[user@node1 BASELINE]$ ./runme_intel64_static
This is a SAMPLE run script.  Change it to reflect the correct number
of CPUs/threads, number of nodes, MPI processes per node, etc..
This run was done on: Wed Apr  8 22:37:48 IST 2020
RANK=0, NODE=0
RANK=1, NODE=1
================================================================================
HPLinpack 2.1  --  High-Performance Linpack benchmark  --   October 26, 2012
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N        :    1000
NB       :     192
PMAP     : Column-major process mapping
P        :       1
Q        :       1
PFACT    :   Right
NBMIN    :       2
NDIV     :       2
RFACT    :   Crout
BCAST    :   1ring
DEPTH    :       0
SWAP     : Binary-exchange
L1       : no-transposed form
U        : no-transposed form
EQUIL    : no
ALIGN    :    8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

node1          : Column=000192 Fraction=0.005 Kernel=    0.00 Mflops=99748.31
node1          : Column=000384 Fraction=0.195 Kernel=67904.30 Mflops=84547.57
node1          : Column=000576 Fraction=0.385 Kernel=39287.97 Mflops=70666.21
node1          : Column=000768 Fraction=0.595 Kernel=18197.26 Mflops=59578.53
node1          : Column=000960 Fraction=0.795 Kernel= 4634.78 Mflops=51930.16
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC00C2R2        1000   192     1     1               0.01            4.96887e+01
HPL_pdgesv() start time Wed Apr  8 22:37:49 2020

HPL_pdgesv() end time   Wed Apr  8 22:37:49 2020

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0059446 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================
Done: Wed Apr  8 22:37:49 IST 2020

here is the fi_info putput -

[user@node1 BASELINE]$ fi_info
provider: mlx
    fabric: mlx
    domain: mlx
    version: 1.5
    type: FI_EP_UNSPEC
    protocol: FI_PROTO_MLX
provider: mlx;ofi_rxm
    fabric: mlx
    domain: mlx
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXM

also i tested the mpi hello word -

[user@node1 BASELINE]$ mpiicc hello.c
[user@node1 BASELINE]$ mpirun -np 2 ./a.out
Hello world from processor node61, rank 0 out of 2 processors
Hello world from processor node61, rank 1 out of 2 processors

Please advice.

Michael_Intel · ‎08-04-2020

The product fix is part of MKL 2020 update 2.

This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only

View solution in original post

PrasanthD_intel · ‎04-09-2020

Hi Puneet,

We have tried and observed the same. The benchmark fails to run when the provider is set to mlx while works if the provider is tcp/verbs.

Thanks for reporting it to us.

We will be forwarding this to the respective team.

Thanks

Prasanth

psing51 · ‎04-09-2020

Hi,
Could you please share the settings/env variables to make it work?
i need to run the hpl only on single node (so fabrics doesn't matter to me for now).
here is what i get when i use tcp -

[user@node1 test]$ export I_MPI_FABRICS=tcp
[user@node1 test]$ ./runme_intel64_static
This is a SAMPLE run script.  Change it to reflect the correct number
of CPUs/threads, number of nodes, MPI processes per node, etc..
This run was done on: Thu Apr  9 16:40:03 IST 2020
RANK=1, NODE=1
RANK=0, NODE=0
MPI startup(): tcp fabric is unknown or has been removed from the product, please use ofi or shm:ofi instead

Abort(1094543) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(649)......:
MPID_Init(861).............:
MPIDI_NM_mpi_init_hook(953): OFI fi_open domain failed (ofi_init.h:953:MPIDI_NM_mpi_init_hook:No data available)
Done: Thu Apr  9 16:40:03 IST 2020

UPDATE: with export FI_PROVIDER=tcp , i am able to run HPL

user@node1 test]$ export FI_PROVIDER=tcp
user@node1 test]$ ./runme_intel64_static
This is a SAMPLE run script.  Change it to reflect the correct number
of CPUs/threads, number of nodes, MPI processes per node, etc..
This run was done on: Thu Apr  9 16:48:44 IST 2020
RANK=1, NODE=1
RANK=0, NODE=0
[0] MPI startup(): I_MPI_DAPL_DIRECT_COPY_THRESHOLD variable has been removed from the product, its value is ignored

[0] MPI startup(): I_MPI_DAPL_DIRECT_COPY_THRESHOLD environment variable is not supported.
[0] MPI startup(): Similar variables:
         I_MPI_SHM_SEND_TINY_MEMCPY_THRESHOLD
[0] MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started.
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N        :    1000
NB       :     192
PMAP     : Column-major process mapping
P        :       1
Q        :       1
PFACT    :   Right
NBMIN    :       2
NDIV     :       2
RFACT    :   Crout
BCAST    :   1ring
DEPTH    :       0
SWAP     : Binary-exchange
L1       : no-transposed form
U        : no-transposed form
EQUIL    : no
ALIGN    :    8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

node1          : Column=000192 Fraction=0.005 Kernel=    0.00 Mflops=102934.66
node1          : Column=000384 Fraction=0.195 Kernel=67954.85 Mflops=85968.97
node1          : Column=000576 Fraction=0.385 Kernel=40114.53 Mflops=71945.58
node1          : Column=000768 Fraction=0.595 Kernel=19454.64 Mflops=61274.77
node1          : Column=000960 Fraction=0.795 Kernel= 5232.37 Mflops=54078.56
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC00C2R2        1000   192     1     1               0.01            4.86455e+01
HPL_pdgesv() start time Thu Apr  9 16:48:45 2020

HPL_pdgesv() end time   Thu Apr  9 16:48:45 2020

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   5.94455135e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================
Done: Thu Apr  9 16:48:45 IST 2020

Michael_Intel · ‎04-09-2020

Hello,

The workaround in your case would be either to use Ethernet as you already mentioned or leverage the InfiniBand* fabric via verbs (I_MPI_OFI_PROVIDER=verbs). Since you are running along a single node only, you may also use the shared memory transport layer from Intel MPI (I_MPI_FABRICS=shm:ofi).

However, the actual issue is that the new (default) mlx provider does not work for you. Therefore please refer to the requirements and limitations of the mlx provider using the following link.: https://software.intel.com/en-us/articles/improve-performance-and-stability-with-intel-mpi-library-on-infiniband

Please let us know if the issue is resolved.

Best regards,

Michael

psing51 · ‎04-09-2020

here's what i get on using shm:ofi -

[user@node1 test]$ export I_MPI_FABRICS=shm:ofi
[user@node1 test]$ ./runme_intel64_static
This is a SAMPLE run script.  Change it to reflect the correct number
of CPUs/threads, number of nodes, MPI processes per node, etc..
This run was done on: Fri Apr 10 09:34:18 IST 2020
RANK=0, NODE=0
RANK=1, NODE=1
Abort(1094543) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(649)......:
MPID_Init(861).............:
MPIDI_NM_mpi_init_hook(953): OFI fi_open domain failed (ofi_init.h:953:MPIDI_NM_mpi_init_hook:No data available)
Done: Fri Apr 10 09:34:18 IST 2020

and with shm-

[user@node1 test]$ export I_MPI_FABRICS=shm
[user@node1 test]$ ./runme_intel64_static
This is a SAMPLE run script.  Change it to reflect the correct number
of CPUs/threads, number of nodes, MPI processes per node, etc..
This run was done on: Fri Apr 10 09:34:58 IST 2020
RANK=0, NODE=0
RANK=1, NODE=1
MPI startup(): shm fabric is unknown or has been removed from the product, please use ofi or shm:ofi instead

Abort(1094543) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(649)......:
MPID_Init(861).............:
MPIDI_NM_mpi_init_hook(953): OFI fi_open domain failed (ofi_init.h:953:MPIDI_NM_mpi_init_hook:No data available)
Abort(1094543) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(649)......:
MPID_Init(861).............:
MPIDI_NM_mpi_init_hook(953): OFI fi_open domain failed (ofi_init.h:953:MPIDI_NM_mpi_init_hook:No data available)
Done: Fri Apr 10 09:34:58 IST 2020

Michael_Intel · ‎04-14-2020

Hello,

This is an issue with the statically linked XHPL benchmark from MKL that links against an Intel MPI library version that is not aware of the new mlx provider.

Therefore as a workaround you might instead use the dynamically linked XHPL or alternatively use a different fabric provider like FI_FABRICS=verbs.

This is not an issue in Intel MPI, but I am doing an internal follow up with the MKL team.

Best regards,

Michael

Michael_Intel · ‎04-14-2020

Hello,

We can confirm that this is a bug in MKL and will track it accordingly.

Best regards,

Michael

Michael_Intel · ‎08-04-2020

The product fix is part of MKL 2020 update 2.

This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only