Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
1828 Discussions

Interpreting Intel cluster checker results

Amit1
Beginner
1,385 Views

Hi,

We are trying to figure out issues w.r.t the machine “host-e8” when launching Mpi jobs on it.

 

A simple MPI ring application is failing with the following error when host-e8 is included in the hostfile.

 

Abort(1615503) on node 4 (rank 4 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:

MPIR_Init_thread(136)........:

MPID_Init(904)...............:

MPIDI_OFI_mpi_init_hook(1421):

MPIDU_bc_table_create(338)...: Missing hostname or invalid host/port description in business card

 

This application works fine when host-e8 is excluded from the host-file (machine list).

 

To analyze this issue, we tried using cluster checker which was recommended in one of the other posts on this forum.

I have attached the corresponding cluster checker log with this post.

 

Can you please help us with the interpretation of this log as this seems to mostly contain the differences between various hosts that were specified with “-f (machinelist) without really high-lighting any issue with host-e8 that can explain this error.

It will be useful if you can also recommend potential remedies.

 

Thanks,

_Amit

 

0 Kudos
22 Replies
SantoshY_Intel
Moderator
92 Views

Hi,


We are working on your issue and we will get back to you soon.


Thanks & Regards,

Santosh


segmentation_fault
New Contributor I
50 Views

I too have got the same error:

 

 

 

 

MPIDU_bc_table_create(370)...: Missing hostname or invalid host/port description in business card

 

 

 

 

I think it may have to do with the two hosts not being the same OS. One is RH 7.6, the other is RH 7.5.  The weird thing is running a basic hostname works ok:

 

 

[me@lustwzb34 pt2pt]$  mpirun -np 2 -ppn 1 -hosts lustwzb34,lustwzb33 hostname
lustwzb34
lustwzb33
[me@lustwzb34 pt2pt]$

 

 

Here is my entire output:

 

 

 

[me@lustwzb33 pt2pt]$ mpirun -np 2 -ppn 1 -hosts lustwzb33,lustwzb34 ./osu_latency
[0] MPI startup(): Intel(R) MPI Library, Version 2021.4  Build 20210831 (id: 758087adf)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.0-impi
libfabric:23435:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:23435:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:23435:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:23435:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:23435:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:23435:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:23435:core:core:ofi_register_provider():474<info> registering provider: verbs (113.0)
libfabric:23435:core:core:ofi_register_provider():474<info> registering provider: tcp (113.0)
libfabric:23435:core:core:ofi_register_provider():474<info> registering provider: sockets (113.0)
libfabric:23435:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:23435:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:23435:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:23435:core:core:ofi_register_provider():474<info> registering provider: shm (113.0)
libfabric:23435:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.0)
libfabric:23435:core:core:ofi_register_provider():474<info> registering provider: psm2 (113.0)
libfabric:23435:psm3:core:fi_prov_ini():680<info> build options: VERSION=1101.0=11.1.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:23435:core:core:ofi_register_provider():474<info> registering provider: psm3 (1101.0)
libfabric:23435:core:core:ofi_register_provider():474<info> registering provider: mlx (1.4)
libfabric:23435:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.0)
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:23435:core:core:fi_getinfo_():1201<info> Start regular provider search because provider with the highest priority psm2 can not be initialized
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:23435:core:core:fi_getinfo_():1201<info> Start regular provider search because provider with the highest priority psm2 can not be initialized
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;psm3 layering
libfabric:23435:core:core:ofi_layering_ok():1001<info> Need core provider, skipping ofi_rxm
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;sockets layering
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;shm layering
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:23435:core:core:fi_getinfo_():1201<info> Start regular provider search because provider with the highest priority psm2 can not be initialized
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;psm3 layering
libfabric:23435:core:core:ofi_layering_ok():1001<info> Need core provider, skipping ofi_rxm
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;sockets layering
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;shm layering
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:23435:core:core:fi_getinfo_():1201<info> Start regular provider search because provider with the highest priority psm2 can not be initialized
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;psm3 layering
libfabric:23435:core:core:ofi_layering_ok():1001<info> Need core provider, skipping ofi_rxm
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;sockets layering
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;shm layering
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:23435:core:core:fi_getinfo_():1201<info> Start regular provider search because provider with the highest priority psm2 can not be initialized
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;psm3 layering
libfabric:23435:core:core:ofi_layering_ok():1001<info> Need core provider, skipping ofi_rxm
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;sockets layering
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;shm layering
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:23435:core:core:fi_getinfo_():1201<info> Start regular provider search because provider with the highest priority psm2 can not be initialized
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority verbs, must_use_util_prov = 1
libfabric:23435:core:core:fi_getinfo_():1161<info> Since verbs can be used, psm2 has been skipped. To use psm2, please, set FI_PROVIDER=psm2
libfabric:23435:core:core:fi_getinfo_():1161<info> Since verbs can be used, mlx has been skipped. To use mlx, please, set FI_PROVIDER=mlx
libfabric:23435:core:core:fi_getinfo_():1161<info> Since verbs can be used, psm3 has been skipped. To use psm3, please, set FI_PROVIDER=psm3
libfabric:23435:core:core:fi_getinfo_():1161<info> Since verbs can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:23435:core:core:fi_getinfo_():1161<info> Since verbs can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=sockets
libfabric:23435:core:core:fi_getinfo_():1161<info> Since verbs can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority tcp, must_use_util_prov = 1
libfabric:23435:core:core:fi_getinfo_():1161<info> Since tcp can be used, psm2 has been skipped. To use psm2, please, set FI_PROVIDER=psm2
libfabric:23435:core:core:fi_getinfo_():1161<info> Since tcp can be used, mlx has been skipped. To use mlx, please, set FI_PROVIDER=mlx
libfabric:23435:core:core:fi_getinfo_():1161<info> Since tcp can be used, psm3 has been skipped. To use psm3, please, set FI_PROVIDER=psm3
libfabric:23435:core:core:fi_getinfo_():1161<info> Since tcp can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:23435:core:core:fi_getinfo_():1161<info> Since tcp can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=sockets
libfabric:23435:core:core:fi_getinfo_():1161<info> Since tcp can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:23435:core:core:fi_getinfo_():1201<info> Start regular provider search because provider with the highest priority psm2 can not be initialized
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;psm3 layering
libfabric:23435:core:core:ofi_layering_ok():1001<info> Need core provider, skipping ofi_rxm
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;sockets layering
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;shm layering
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:23435:core:core:fi_getinfo_():1201<info> Start regular provider search because provider with the highest priority psm2 can not be initialized
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;psm3 layering
libfabric:23435:core:core:ofi_layering_ok():1001<info> Need core provider, skipping ofi_rxm
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;sockets layering
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;shm layering
[0] MPI startup(): libfabric provider: mlx
libfabric:23435:core:core:fi_fabric_():1423<info> Opened fabric: mlx
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 64, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrnamelen: 1024
Abort(1615503) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138)........:
MPID_Init(1183)..............:
MPIDI_OFI_mpi_init_hook(1968):
MPIDU_bc_table_create(370)...: Missing hostname or invalid host/port description in business card

 

 

Reply