hidden text to trigger early load of fonts ПродукцияПродукцияПродукцияПродукция Các sản phẩmCác sản phẩmCác sản phẩmCác sản phẩm المنتجاتالمنتجاتالمنتجاتالمنتجات מוצריםמוצריםמוצריםמוצרים
Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2245 토론

Hello World program cannot run on cluster

WangWJ
초보자
914 조회수

I have trouble when trying to run a hello world program on cluster.Here is my program:

#include"mpi.h"
#include<iostream>
int main(int argc, char *argv[])
{
    int myid,numprocs;
    MPI_Status status;
    MPI_Init(&argc,&argv);
    MPI_Comm_rank(MPI_COMM_WORLD,&myid);
    MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
    std::cout<<"process: "<<myid<<" of "<<numprocs<<" hello world"<<std::endl;
    MPI_Finalize();
    return 0;
}

 I have complied it by using  gxx by

mpigxx main.cpp

It runs ok on both host1 and host 2 when I use

mpirun -n 4 ./a.out

But when I try to run on the cluster:

mpirun -n 4 -ppn 2 -hosts host1,host2 ./a.out

there is a problem with it:

Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Unknown error class, error stack:
MPIR_Init_thread(193)........:
MPID_Init(1715)..............:
MPIDI_OFI_mpi_init_hook(1724):
MPIDU_bc_table_create(340)...: Missing hostname or invalid host/port description in business card
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Unknown error class, error stack:
MPIR_Init_thread(193)............:
MPID_Init(1715)..................:
MPIDI_OFI_mpi_init_hook(1739)....:
insert_addr_table_roots_only(492): OFI get address vector map failed

Here are some more informations with I_MPI_DEBUG=10

[0] MPI startup(): Intel(R) MPI Library, Version 2021.14  Build 20241121 (id: e7829d6)
[0] MPI startup(): Copyright (C) 2003-2024 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric loaded: libfabric.so.1
[0] MPI startup(): libfabric version: 1.21.0-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: shm
Abort(1615247) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Unknown error class, error stack:
MPIR_Init_thread(193)........:
MPID_Init(1715)..............:
MPIDI_OFI_mpi_init_hook(1724):
MPIDU_bc_table_create(340)...: Missing hostname or invalid host/port description in business card

 Is there anyone could help me with this problem?

레이블 (1)
  • MPI

0 포인트
1 솔루션
WangWJ
초보자
535 조회수

Thanks. This helps me a lot. I change the OS of all nodes to Ubuntu20.04 and set FI_PROVIDER=tcp. It now runs well.

원본 게시물의 솔루션 보기

0 포인트
7 응답
TobiasK
중재자
820 조회수

@WangWJ 

[0] MPI startup(): libfabric provider: shm

 

Do you know who set this? can you please post the output of 
"export" ?

0 포인트
WangWJ
초보자
763 조회수

Libfabric provider was Automatically setted by MPI. I did not set any other MPI related environment variables except I_MPI_DEBUG.

0 포인트
TobiasK
중재자
699 조회수

@WangWJ 
can you provide more details on your environment, like OS/HW/SW?

0 포인트
WangWJ
초보자
642 조회수

host1:
OS: CentOS Linux release 7.5.1804
CPU: Intel Xeon Gold 6248R CPU @ 3.00GHz   32cores
GCC:4.8.5
IntelMPI:Version 2021.14 Build 20241121

host2:
OS: Ubuntu20.04.6 LTS
CPU: Intel Xeon Gold 6248R CPU @ 3.00GHz   32cores
GCC:9.4.0
IntelMPI:Version 2021.14 Build 20241121

 

Both of them are HuaWeiYun cloud servers

0 포인트
TobiasK
중재자
620 조회수

@WangWJ CentOS 7.5 is not supported anymore, additionally please use the same OS/SW stack on all nodes

0 포인트
WangWJ
초보자
536 조회수

Thanks. This helps me a lot. I change the OS of all nodes to Ubuntu20.04 and set FI_PROVIDER=tcp. It now runs well.

0 포인트
dusktilldawn
새로운 기여자 I
602 조회수

Make sure that the hostnames host1 and host2 are correctly configured in your /etc/hosts file or DNS system, and that they can communicate with each other.

Verify that you can SSH from one node to the other (host1 to host2, and vice versa) without requiring a password. If passwordless SSH isn't set up, MPI won't be able to launch processes across nodes.

The error mentions "OFI" (which is part of the network fabric layer). Ensure that your cluster nodes have proper network configuration and are able to communicate via the correct interfaces.

The error could also be related to a mismatch in MPI versions or configuration. Ensure both nodes are using the same MPI library and version.

Your mpirun command looks fine, but you can try simplifying it to ensure it's not a syntax issue:

mpirun -n 4 -hostfile hosts.txt ./a.out


In hosts.txt, list your hosts like:

host1 slots=2
host2 slots=2


Double-check your environment variables (I_MPI_DEBUG, etc.) for any misconfiguration, as they can cause initialization errors.

Try these steps, and if it still doesn't work, providing more details about your network setup or MPI installation might help narrow down the issue.

0 포인트
응답