- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are trying to figure out issues w.r.t the machine “host-e8” when launching Mpi jobs on it.
A simple MPI ring application is failing with the following error when host-e8 is included in the hostfile.
Abort(1615503) on node 4 (rank 4 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........:
MPID_Init(904)...............:
MPIDI_OFI_mpi_init_hook(1421):
MPIDU_bc_table_create(338)...: Missing hostname or invalid host/port description in business card
This application works fine when host-e8 is excluded from the host-file (machine list).
To analyze this issue, we tried using cluster checker which was recommended in one of the other posts on this forum.
I have attached the corresponding cluster checker log with this post.
Can you please help us with the interpretation of this log as this seems to mostly contain the differences between various hosts that were specified with “-f (machinelist) without really high-lighting any issue with host-e8 that can explain this error.
It will be useful if you can also recommend potential remedies.
Thanks,
_Amit
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are working on your issue and we will get back to you soon.
Thanks & Regards,
Santosh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I too have got the same error:
MPIDU_bc_table_create(370)...: Missing hostname or invalid host/port description in business card
I think it may have to do with the two hosts not being the same OS. One is RH 7.6, the other is RH 7.5. The weird thing is running a basic hostname works ok:
[me@lustwzb34 pt2pt]$ mpirun -np 2 -ppn 1 -hosts lustwzb34,lustwzb33 hostname
lustwzb34
lustwzb33
[me@lustwzb34 pt2pt]$
Here is my entire output:
[me@lustwzb33 pt2pt]$ mpirun -np 2 -ppn 1 -hosts lustwzb33,lustwzb34 ./osu_latency
[0] MPI startup(): Intel(R) MPI Library, Version 2021.4 Build 20210831 (id: 758087adf)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.0-impi
libfabric:23435:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:23435:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:23435:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:23435:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:23435:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:23435:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:23435:core:core:ofi_register_provider():474<info> registering provider: verbs (113.0)
libfabric:23435:core:core:ofi_register_provider():474<info> registering provider: tcp (113.0)
libfabric:23435:core:core:ofi_register_provider():474<info> registering provider: sockets (113.0)
libfabric:23435:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:23435:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:23435:core:core:ofi_hmem_init():209<info> Hmem iface FI_HMEM_ZE not supported
libfabric:23435:core:core:ofi_register_provider():474<info> registering provider: shm (113.0)
libfabric:23435:core:core:ofi_register_provider():474<info> registering provider: ofi_rxm (113.0)
libfabric:23435:core:core:ofi_register_provider():474<info> registering provider: psm2 (113.0)
libfabric:23435:psm3:core:fi_prov_ini():680<info> build options: VERSION=1101.0=11.1.0.0, HAVE_PSM3_src=1, PSM3_CUDA=0
libfabric:23435:core:core:ofi_register_provider():474<info> registering provider: psm3 (1101.0)
libfabric:23435:core:core:ofi_register_provider():474<info> registering provider: mlx (1.4)
libfabric:23435:core:core:ofi_register_provider():474<info> registering provider: ofi_hook_noop (113.0)
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:23435:core:core:fi_getinfo_():1201<info> Start regular provider search because provider with the highest priority psm2 can not be initialized
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:23435:core:core:fi_getinfo_():1201<info> Start regular provider search because provider with the highest priority psm2 can not be initialized
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;psm3 layering
libfabric:23435:core:core:ofi_layering_ok():1001<info> Need core provider, skipping ofi_rxm
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;sockets layering
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;shm layering
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:23435:core:core:fi_getinfo_():1201<info> Start regular provider search because provider with the highest priority psm2 can not be initialized
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;psm3 layering
libfabric:23435:core:core:ofi_layering_ok():1001<info> Need core provider, skipping ofi_rxm
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;sockets layering
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;shm layering
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:23435:core:core:fi_getinfo_():1201<info> Start regular provider search because provider with the highest priority psm2 can not be initialized
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;psm3 layering
libfabric:23435:core:core:ofi_layering_ok():1001<info> Need core provider, skipping ofi_rxm
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;sockets layering
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;shm layering
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:23435:core:core:fi_getinfo_():1201<info> Start regular provider search because provider with the highest priority psm2 can not be initialized
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;psm3 layering
libfabric:23435:core:core:ofi_layering_ok():1001<info> Need core provider, skipping ofi_rxm
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;sockets layering
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;shm layering
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:23435:core:core:fi_getinfo_():1201<info> Start regular provider search because provider with the highest priority psm2 can not be initialized
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority verbs, must_use_util_prov = 1
libfabric:23435:core:core:fi_getinfo_():1161<info> Since verbs can be used, psm2 has been skipped. To use psm2, please, set FI_PROVIDER=psm2
libfabric:23435:core:core:fi_getinfo_():1161<info> Since verbs can be used, mlx has been skipped. To use mlx, please, set FI_PROVIDER=mlx
libfabric:23435:core:core:fi_getinfo_():1161<info> Since verbs can be used, psm3 has been skipped. To use psm3, please, set FI_PROVIDER=psm3
libfabric:23435:core:core:fi_getinfo_():1161<info> Since verbs can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:23435:core:core:fi_getinfo_():1161<info> Since verbs can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=sockets
libfabric:23435:core:core:fi_getinfo_():1161<info> Since verbs can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority tcp, must_use_util_prov = 1
libfabric:23435:core:core:fi_getinfo_():1161<info> Since tcp can be used, psm2 has been skipped. To use psm2, please, set FI_PROVIDER=psm2
libfabric:23435:core:core:fi_getinfo_():1161<info> Since tcp can be used, mlx has been skipped. To use mlx, please, set FI_PROVIDER=mlx
libfabric:23435:core:core:fi_getinfo_():1161<info> Since tcp can be used, psm3 has been skipped. To use psm3, please, set FI_PROVIDER=psm3
libfabric:23435:core:core:fi_getinfo_():1161<info> Since tcp can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:23435:core:core:fi_getinfo_():1161<info> Since tcp can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=sockets
libfabric:23435:core:core:fi_getinfo_():1161<info> Since tcp can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:23435:core:core:fi_getinfo_():1201<info> Start regular provider search because provider with the highest priority psm2 can not be initialized
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;psm3 layering
libfabric:23435:core:core:ofi_layering_ok():1001<info> Need core provider, skipping ofi_rxm
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;sockets layering
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;shm layering
libfabric:23435:core:core:fi_getinfo_():1138<info> Found provider with the highest priority psm2, must_use_util_prov = 0
libfabric:23435:core:core:fi_getinfo_():1201<info> Start regular provider search because provider with the highest priority psm2 can not be initialized
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;psm3 layering
libfabric:23435:core:core:ofi_layering_ok():1001<info> Need core provider, skipping ofi_rxm
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;sockets layering
libfabric:23435:core:core:ofi_layering_ok():1007<info> Skipping util;shm layering
[0] MPI startup(): libfabric provider: mlx
libfabric:23435:core:core:fi_fabric_():1423<info> Opened fabric: mlx
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 64, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrnamelen: 1024
Abort(1615503) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138)........:
MPID_Init(1183)..............:
MPIDI_OFI_mpi_init_hook(1968):
MPIDU_bc_table_create(370)...: Missing hostname or invalid host/port description in business card
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »