- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I replaced a bad hard drive in a 16 blade Dell enclosure with an infiniband network. I installed Centos 7.9 which is the same OS all the other blades are running. I then installed all the infiniband drivers and verified the infiniband network is working ok on this node ( lustwzb3 ).
When I run the Intel MPI benchmark, I receive this error:
[143]hussaif1@lustwzb3:~/mpi-benchmarks $ mpirun -np 2 -ppn 1 -hosts lustwzb3,lustwzb4 IMB-P2P
Abort(1615503) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138)........:
MPID_Init(1183)..............:
MPIDI_OFI_mpi_init_hook(1968):
MPIDU_bc_table_create(370)...: Missing hostname or invalid host/port description in business card
[1649275191.657810] [lustwzb4:34963:0] select.c:406 UCX ERROR no active messages transport to <no debug data>: Unsupported operation
[143]hussaif1@lustwzb3:~/mpi-benchmarks $
All other nodes in this enclosure can run this benchmark ok. I feel there is some configuration issue with this rebuilt node ( lustwzb3 ) . I next tried running Cluster checker and it gave this:
[0]hussaif1@lustwzb3:~/mpi/osu-micro-benchmarks-5.3.2/mpi/pt2pt $ clck -f ./hosts -F mpi_prereq_user
Intel(R) Cluster Checker 2021 Update 4 (build 20210910)
Running Collect
................................................................
Running Analyze
SUMMARY
Command-line: clck -f ./hosts -F mpi_prereq_user
Tests Run: mpi_prereq_user
**WARNING**: 3 tests failed to run. Information may be incomplete. See clck_execution_warnings.log for more information.
Overall Result: Could not run all tests.
--------------------------------------------------------------------------------------------------------------------------------------------------------------
2 nodes tested: lustwzb[3-4]
2 nodes with no issues: lustwzb[3-4]
0 nodes with issues:
--------------------------------------------------------------------------------------------------------------------------------------------------------------
FUNCTIONALITY
No issues detected.
HARDWARE UNIFORMITY
No issues detected.
PERFORMANCE
No issues detected.
SOFTWARE UNIFORMITY
No issues detected.
See the following files for more information: clck_results.log, clck_execution_warnings.log
And here is the contents of the warning file:
[0]hussaif1@lustwzb3:~/mpi/osu-micro-benchmarks-5.3.2/mpi/pt2pt $ cat clck_execution_warnings.log
Intel(R) Cluster Checker 2021 Update 4 (build 20210910)
Command-line: clck -f ./hosts -F mpi_prereq_user
RUNTIME ERRORS
Intel(R) Cluster Checker encountered the following errors during execution:
1. ethtool-data-error
Message: The 'ethtool' provider was executed but did not run successfully
due to an unknown reason. The 'ethtool' data is either not
parsable or the provider did not run correctly. Some ethernet
related analysis may not execute successfully because of this.
Remedy: Please ensure that the 'ethtool' command is installed and
available via $PATH. For more details, use the following
command: 'clckdb --provider ethtool [--db filename]'. If the
data is missing, please collect it by executing clck or
clck-collect with the '-F ethernet' command line option.
2 nodes: lustwzb[3-4]
Test: ethernet
2. infiniband-data-missing
Message: InfiniBand data is not available for analysis.
Remedy: For more details, use the following command: 'clckdb --provider
ibstat --provider lspci --provider ofedinfo [--db filename]'. If
the data is missing, please collect it by executing clck or
clck-collect with the '-F infiniband_user' or '-F
infiniband_admin' command line option. If InfiniBand is not
present, adding an entry to the Intel Cluster Checker
configuration file allows to disable the 'infiniband_user' or
'infiniband_admin' check or to suppress this message.
2 nodes: lustwzb[3-4]
Test: infiniband_base
3. no-data
Message: Data for one or more checks are not available or could not be
parsed correctly. "Test(s):" listed below where affected. Search
the output for the listed test name(s) to verify if they ran
correctly.
Remedy: Verify the correct database is being used and contains valid
data, the 'clckdb' tool can be used to query the database. If
necessary, recollect the missing data or ignore if you know it
is not needed.
2 nodes: lustwzb[3-4]
Test: infiniband_base
--------------------------------------------------------------------------------
Intel(R) Cluster Checker 2021 Update 4
19:17:44 April 6 2022 UTC
Nodefile used: ./hosts
Databases used: $HOME/.clck/2021.4.0/clck.db
[0]hussaif1@lustwzb3:~/mpi/osu-micro-benchmarks-5.3.2/mpi/pt2pt $
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I solved it by doing a "yum install -y openmpi3 openmpi3-devel" . This installed the following libraries that fixed the error:
==============================================================================================================================================================
Package Arch Version Repository Size
==============================================================================================================================================================
Installing:
openmpi3 x86_64 3.1.3-2.el7 base 2.9 M
openmpi3-devel x86_64 3.1.3-2.el7 base 853 k
Installing for dependencies:
environment-modules x86_64 3.2.10-10.el7 base 107 k
infinipath-psm x86_64 3.3-26_g604758e_open.2.el7 base 186 k
libfabric x86_64 1.7.2-1.el7 base 536 k
libpsm2 x86_64 11.2.78-1.el7 base 189 k
ucx x86_64 1.5.2-1.el7 base 443 k
Transaction Summary
==============================================================================================================================================================
Install 2 Packages (+5 Dependent packages)
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I solved it by doing a "yum install -y openmpi3 openmpi3-devel" . This installed the following libraries that fixed the error:
==============================================================================================================================================================
Package Arch Version Repository Size
==============================================================================================================================================================
Installing:
openmpi3 x86_64 3.1.3-2.el7 base 2.9 M
openmpi3-devel x86_64 3.1.3-2.el7 base 853 k
Installing for dependencies:
environment-modules x86_64 3.2.10-10.el7 base 107 k
infinipath-psm x86_64 3.3-26_g604758e_open.2.el7 base 186 k
libfabric x86_64 1.7.2-1.el7 base 536 k
libpsm2 x86_64 11.2.78-1.el7 base 189 k
ucx x86_64 1.5.2-1.el7 base 443 k
Transaction Summary
==============================================================================================================================================================
Install 2 Packages (+5 Dependent packages)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Glad to know that your issue is resolved. Thanks for sharing the solution with us. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
Thanks & Regards,
Varsha
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page