Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2161 Discussions

Strange MPI errors after rebuilding node in a cluster

segmentation_fault
New Contributor I
583 Views

I replaced a bad hard drive in a 16 blade Dell enclosure with an infiniband network. I installed Centos 7.9 which is the same OS all the other blades are running. I then installed all the infiniband drivers and verified the infiniband network is working ok on this node ( lustwzb3 ). 

When I run the Intel MPI benchmark,  I receive this error:

 

[143]hussaif1@lustwzb3:~/mpi-benchmarks $  mpirun -np 2 -ppn 1 -hosts lustwzb3,lustwzb4 IMB-P2P
Abort(1615503) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138)........:
MPID_Init(1183)..............:
MPIDI_OFI_mpi_init_hook(1968):
MPIDU_bc_table_create(370)...: Missing hostname or invalid host/port description in business card
[1649275191.657810] [lustwzb4:34963:0]         select.c:406  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation
[143]hussaif1@lustwzb3:~/mpi-benchmarks $

 

 All other nodes in this enclosure can run this benchmark ok. I feel there is some configuration issue with this rebuilt node ( lustwzb3 ) . I next tried running Cluster checker and it gave this:

 

[0]hussaif1@lustwzb3:~/mpi/osu-micro-benchmarks-5.3.2/mpi/pt2pt $ clck -f ./hosts -F mpi_prereq_user
Intel(R) Cluster Checker 2021 Update 4 (build 20210910)

Running Collect

................................................................
Running Analyze

SUMMARY
  Command-line:   clck -f ./hosts -F mpi_prereq_user
  Tests Run:      mpi_prereq_user
  **WARNING**:    3 tests failed to run. Information may be incomplete. See clck_execution_warnings.log for more information.
  Overall Result: Could not run all tests.
--------------------------------------------------------------------------------------------------------------------------------------------------------------
2 nodes tested:         lustwzb[3-4]
2 nodes with no issues: lustwzb[3-4]
0 nodes with issues:
--------------------------------------------------------------------------------------------------------------------------------------------------------------
FUNCTIONALITY
No issues detected.

HARDWARE UNIFORMITY
No issues detected.

PERFORMANCE
No issues detected.

SOFTWARE UNIFORMITY
No issues detected.

See the following files for more information: clck_results.log, clck_execution_warnings.log

 

And here is the contents of the warning file:

 

[0]hussaif1@lustwzb3:~/mpi/osu-micro-benchmarks-5.3.2/mpi/pt2pt $ cat clck_execution_warnings.log
Intel(R) Cluster Checker 2021 Update 4 (build 20210910)
Command-line: clck -f ./hosts -F mpi_prereq_user
RUNTIME ERRORS
Intel(R) Cluster Checker encountered the following errors during execution:
  1. ethtool-data-error
       Message: The 'ethtool' provider was executed but did not run successfully
                due to an unknown reason. The 'ethtool' data is either not
                parsable or the provider did not run correctly. Some ethernet
                related analysis may not execute successfully because of this.
       Remedy:  Please ensure that the 'ethtool' command is installed and
                available via $PATH. For more details, use the following
                command: 'clckdb --provider ethtool [--db filename]'. If the
                data is missing, please collect it by executing clck or
                clck-collect with the '-F ethernet' command line option.
       2 nodes: lustwzb[3-4]
       Test:    ethernet
  2. infiniband-data-missing
       Message: InfiniBand data is not available for analysis.
       Remedy:  For more details, use the following command: 'clckdb --provider
                ibstat --provider lspci --provider ofedinfo [--db filename]'. If
                the data is missing, please collect it by executing clck or
                clck-collect with the '-F infiniband_user' or '-F
                infiniband_admin' command line option. If InfiniBand is not
                present, adding an entry to the Intel Cluster Checker
                configuration file allows to disable the 'infiniband_user' or
                'infiniband_admin' check or to suppress this message.
       2 nodes: lustwzb[3-4]
       Test:    infiniband_base
  3. no-data
       Message: Data for one or more checks are not available or could not be
                parsed correctly. "Test(s):" listed below where affected. Search
                the output for the listed test name(s) to verify if they ran
                correctly.
       Remedy:  Verify the correct database is being used and contains valid
                data, the 'clckdb' tool can be used to query the database. If
                necessary, recollect the missing data or ignore if you know it
                is not needed.
       2 nodes: lustwzb[3-4]
       Test:    infiniband_base

--------------------------------------------------------------------------------
Intel(R) Cluster Checker 2021 Update 4
19:17:44 April 6 2022 UTC
Nodefile used: ./hosts
Databases used: $HOME/.clck/2021.4.0/clck.db
[0]hussaif1@lustwzb3:~/mpi/osu-micro-benchmarks-5.3.2/mpi/pt2pt $

 

 

 

0 Kudos
1 Solution
segmentation_fault
New Contributor I
558 Views

I solved it by doing a "yum install -y openmpi3 openmpi3-devel" . This installed the following libraries that fixed the error:

 

==============================================================================================================================================================
 Package                                    Arch                          Version                                           Repository                   Size
==============================================================================================================================================================
Installing:
 openmpi3                                   x86_64                        3.1.3-2.el7                                       base                        2.9 M
 openmpi3-devel                             x86_64                        3.1.3-2.el7                                       base                        853 k
Installing for dependencies:
 environment-modules                        x86_64                        3.2.10-10.el7                                     base                        107 k
 infinipath-psm                             x86_64                        3.3-26_g604758e_open.2.el7                        base                        186 k
 libfabric                                  x86_64                        1.7.2-1.el7                                       base                        536 k
 libpsm2                                    x86_64                        11.2.78-1.el7                                     base                        189 k
 ucx                                        x86_64                        1.5.2-1.el7                                       base                        443 k

Transaction Summary
==============================================================================================================================================================
Install  2 Packages (+5 Dependent packages)

 

View solution in original post

0 Kudos
2 Replies
segmentation_fault
New Contributor I
559 Views

I solved it by doing a "yum install -y openmpi3 openmpi3-devel" . This installed the following libraries that fixed the error:

 

==============================================================================================================================================================
 Package                                    Arch                          Version                                           Repository                   Size
==============================================================================================================================================================
Installing:
 openmpi3                                   x86_64                        3.1.3-2.el7                                       base                        2.9 M
 openmpi3-devel                             x86_64                        3.1.3-2.el7                                       base                        853 k
Installing for dependencies:
 environment-modules                        x86_64                        3.2.10-10.el7                                     base                        107 k
 infinipath-psm                             x86_64                        3.3-26_g604758e_open.2.el7                        base                        186 k
 libfabric                                  x86_64                        1.7.2-1.el7                                       base                        536 k
 libpsm2                                    x86_64                        11.2.78-1.el7                                     base                        189 k
 ucx                                        x86_64                        1.5.2-1.el7                                       base                        443 k

Transaction Summary
==============================================================================================================================================================
Install  2 Packages (+5 Dependent packages)

 

0 Kudos
VarshaS_Intel
Moderator
535 Views

Hi,


Glad to know that your issue is resolved. Thanks for sharing the solution with us. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.


Thanks & Regards,

Varsha


0 Kudos
Reply