- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have trouble when trying to run a hello world program on cluster.Here is my program:
#include"mpi.h"
#include<iostream>
int main(int argc, char *argv[])
{
int myid,numprocs;
MPI_Status status;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
std::cout<<"process: "<<myid<<" of "<<numprocs<<" hello world"<<std::endl;
MPI_Finalize();
return 0;
}
I have complied it by using gxx by
mpigxx main.cpp
It runs ok on both host1 and host 2 when I use
mpirun -n 4 ./a.out
But when I try to run on the cluster:
mpirun -n 4 -ppn 2 -hosts host1,host2 ./a.out
there is a problem with it:
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Unknown error class, error stack:
MPIR_Init_thread(193)........:
MPID_Init(1715)..............:
MPIDI_OFI_mpi_init_hook(1724):
MPIDU_bc_table_create(340)...: Missing hostname or invalid host/port description in business card
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Unknown error class, error stack:
MPIR_Init_thread(193)............:
MPID_Init(1715)..................:
MPIDI_OFI_mpi_init_hook(1739)....:
insert_addr_table_roots_only(492): OFI get address vector map failed
Here are some more informations with I_MPI_DEBUG=10
[0] MPI startup(): Intel(R) MPI Library, Version 2021.14 Build 20241121 (id: e7829d6)
[0] MPI startup(): Copyright (C) 2003-2024 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.21.0-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: shm
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Unknown error class, error stack:
MPIR_Init_thread(193)........:
MPID_Init(1715)..............:
MPIDI_OFI_mpi_init_hook(1724):
MPIDU_bc_table_create(340)...: Missing hostname or invalid host/port description in business card
Is there anyone could help me with this problem?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks. This helps me a lot. I change the OS of all nodes to Ubuntu20.04 and set FI_PROVIDER=tcp. It now runs well.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[0] MPI startup(): libfabric provider: shm
Do you know who set this? can you please post the output of
"export" ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Libfabric provider was Automatically setted by MPI. I did not set any other MPI related environment variables except I_MPI_DEBUG.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
host1:
OS: CentOS Linux release 7.5.1804
CPU: Intel Xeon Gold 6248R CPU @ 3.00GHz 32cores
GCC:4.8.5
IntelMPI:Version 2021.14 Build 20241121
host2:
OS: Ubuntu20.04.6 LTS
CPU: Intel Xeon Gold 6248R CPU @ 3.00GHz 32cores
GCC:9.4.0
IntelMPI:Version 2021.14 Build 20241121
Both of them are HuaWeiYun cloud servers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@WangWJ CentOS 7.5 is not supported anymore, additionally please use the same OS/SW stack on all nodes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks. This helps me a lot. I change the OS of all nodes to Ubuntu20.04 and set FI_PROVIDER=tcp. It now runs well.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Make sure that the hostnames host1 and host2 are correctly configured in your /etc/hosts file or DNS system, and that they can communicate with each other.
Verify that you can SSH from one node to the other (host1 to host2, and vice versa) without requiring a password. If passwordless SSH isn't set up, MPI won't be able to launch processes across nodes.
The error mentions "OFI" (which is part of the network fabric layer). Ensure that your cluster nodes have proper network configuration and are able to communicate via the correct interfaces.
The error could also be related to a mismatch in MPI versions or configuration. Ensure both nodes are using the same MPI library and version.
Your mpirun command looks fine, but you can try simplifying it to ensure it's not a syntax issue:
mpirun -n 4 -hostfile hosts.txt ./a.out
In hosts.txt, list your hosts like:
host1 slots=2
host2 slots=2
Double-check your environment variables (I_MPI_DEBUG, etc.) for any misconfiguration, as they can cause initialization errors.
Try these steps, and if it still doesn't work, providing more details about your network setup or MPI installation might help narrow down the issue.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page