Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2154 Discussions

mpi machine file and host command do not work.(No route to host, Network is unreachable)

lmh
Beginner
2,356 Views

Hi.

Recently I made my own cluster server for my laboratory and try to use intel mpi.. and got some trouble.

Cluster server has 8 nodes and each of that nodes has intel-i9(Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz).

nodes are connected with wired network.

 

Because I want to check mpi work, I make a code like this:

 

//hello_mpi.c

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv)
{
          MPI_Init(NULL,NULL);
          int world_size;
          MPI_Comm_size(MPI_COMM_WORLD, &world_size);

          int world_rank;
          MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

          char processor_name[MPI_MAX_PROCESSOR_NAME];
          int name_len;
          MPI_Get_processor_name(processor_name, &name_len);

          printf("Hello world from processor %s, rank %d out of %d processors\n", processor_name, world_rank, world_size);
          MPI_Finalize();

}

 

 

and I compile this code with command:

mpiicc ./hello_mpi.c -o hello_mpi

 

compiler works well.

 

When I run this code with command:

mpirun ./hello_mpi

 

It prints out the results:

Hello world from processor node0, rank 2 out of 10 processors
Hello world from processor node0, rank 7 out of 10 processors
Hello world from processor node0, rank 1 out of 10 processors
Hello world from processor node0, rank 3 out of 10 processors
Hello world from processor node0, rank 4 out of 10 processors
Hello world from processor node0, rank 5 out of 10 processors
Hello world from processor node0, rank 9 out of 10 processors
Hello world from processor node0, rank 6 out of 10 processors
Hello world from processor node0, rank 8 out of 10 processors
Hello world from processor node0, rank 0 out of 10 processors

 

 

But, when I try to check this code with multi- node with command:

mpirun -host node0,node1,node2,node3,node4,node5,node6,node7  ./hello_mpi

 

It prints out some trouble message like:

Hello world from processor node0, rank 2 out of 80 processors

....

....

 

Other MPI error, error stack:
PMPI_Finalize(214)...............: MPI_Finalize failed
PMPI_Finalize(159)...............:
MPID_Finalize(1280)..............:
MPIDI_OFI_mpi_finalize_hook(1807):
MPIR_Reduce_intra_binomial(142)..:
MPIC_Send(131)...................:
MPID_Send(771)...................:
MPIDI_send_unsafe(220)...........:
MPIDI_OFI_send_normal(398).......:
MPIDI_OFI_send_handler_vci(647)..: OFI tagged send failed (ofi_impl.h:647:MPIDI_OFI_send_handler_vci:No route to host)
Hello world from processor node5, rank 59 out of 80 processors
Abort(810115343) on node 10 (rank 10 in comm 0): Fatal error in PMPI_Finalize: Other MPI error, error stack:
PMPI_Finalize(214)...............: MPI_Finalize failed
PMPI_Finalize(159)...............:
MPID_Finalize(1280)..............:
MPIDI_OFI_mpi_finalize_hook(1807):
MPIR_Reduce_intra_binomial(142)..:
MPIC_Send(131)...................:
MPID_Send(771)...................:
MPIDI_send_unsafe(220)...........:
MPIDI_OFI_send_normal(398).......:
MPIDI_OFI_send_handler_vci(647)..: OFI tagged send failed (ofi_impl.h:647:MPIDI_OFI_send_handler_vci:Network is unreachable)
Hello world from processor node5, rank 56 out of 80 processors
Abort(810115343) on node 30 (rank 30 in comm 0): Fatal error in PMPI_Finalize: Other MPI error, error stack:
PMPI_Finalize(214)...............: MPI_Finalize failed
PMPI_Finalize(159)...............:
MPID_Finalize(1280)..............:
MPIDI_OFI_mpi_finalize_hook(1807):
MPIR_Reduce_intra_binomial(142)..:
MPIC_Send(131)...................:
MPID_Send(771)...................:
MPIDI_send_unsafe(220)...........:
MPIDI_OFI_send_normal(398).......:
MPIDI_OFI_send_handler_vci(647)..: OFI tagged send failed (ofi_impl.h:647:MPIDI_OFI_send_handler_vci:No route to host)
Abort(810115343) on node 50 (rank 50 in comm 0): Fatal error in PMPI_Finalize: Other MPI error, error stack:
PMPI_Finalize(214)...............: MPI_Finalize failed
PMPI_Finalize(159)...............:
MPID_Finalize(1280)..............:
MPIDI_OFI_mpi_finalize_hook(1807):
MPIR_Reduce_intra_binomial(142)..:
MPIC_Send(131)...................:
MPID_Send(771)...................:
MPIDI_send_unsafe(220)...........:
MPIDI_OFI_send_normal(398).......:
MPIDI_OFI_send_handler_vci(647)..: OFI tagged send failed (ofi_impl.h:647:MPIDI_OFI_send_handler_vci:No route to host)

 

 

I think that it comes from hardware problem, especially switching hub, but not sure about that.

About this trouble, Is there anyone who can help or suggest?

 

p.s. Sorry for my terrible English...

Labels (1)
0 Kudos
7 Replies
PrasanthD_intel
Moderator
2,285 Views

Hi Lee,


Sorry for the delay in response.

From your output, we can see that mpi is failing because it is unable to reach all the nodes except node5.

Are you able to ssh to all nodes in your cluster?

As mentioned in the pre-requisite steps on this page(Prerequisite Steps (intel.com)), "For communication between cluster nodes, in most cases the Intel MPI Library uses the SSH protocol. You need to establish a passwordless SSH connection to ensure proper communication of MPI processes."


You can use the Intel cluster checker tool to check for any hardware issues in your cluster. Source the cluster checker and run clck -f <hostfile> to check for any errors. For complete steps on how to load and run please refer to Getting Started (intel.com).


Let us know if you face any issues.


Regards

Prasanth


lmh
Beginner
2,272 Views

Are you able to ssh to all nodes in your cluster?

>>Yes. I can.

To check this again, I ran this command:

--------------------------------------------------------------------------------------------------------

[minho@node0 Documents]$ for i in 1 2 3 4 5 6 7
> do
> ssh node$i "which mpiicc"
> done
/opt/intel/oneapi/mpi/2021.1.1/bin/mpiicc
/opt/intel/oneapi/mpi/2021.1.1/bin/mpiicc
/opt/intel/oneapi/mpi/2021.1.1/bin/mpiicc
/opt/intel/oneapi/mpi/2021.1.1/bin/mpiicc
/opt/intel/oneapi/mpi/2021.1.1/bin/mpiicc
/opt/intel/oneapi/mpi/2021.1.1/bin/mpiicc
/opt/intel/oneapi/mpi/2021.1.1/bin/mpiicc
[minho@node0 Documents]$

--------------------------------------------------------------------------------------------------------

 

You can use the Intel cluster checker tool to check for any hardware issues in your cluster.

>> Thank you for your comment. I didn't know that kind of tool before!

I write my hostfile like this:

--------------------------------------------------------------------------------------------------------

[minho@node0 Documents]$ vim ~/machine

//inside the vim file..

node0

node1

node2

node3

node4

node5

node6

node7

--------------------------------------------------------------------------------------------------------

 

and I ran this command and got this message:

 

--------------------------------------------------------------------------------------------------------

[minho@node0 Documents]$ clck -f ~/machine
Intel(R) Cluster Checker 2021 Update 1 (build 20201104)

Running Collect

................................................................................................................................................................................................................
Running Analyze

SUMMARY
Command-line: clck -f /home/minho/machine
Tests Run: health_base
**WARNING**: 1 test failed to run. Information may be incomplete.
See clck_execution_warnings.log for more information.
Overall Result: 2 issues found - PERFORMANCE (1), SOFTWARE UNIFORMITY
(1)
-------------------------------------------------------------------------
8 nodes tested: node[0-7]
0 nodes with no issues:
8 nodes with issues: node[0-7]
-------------------------------------------------------------------------
FUNCTIONALITY
No issues detected.

HARDWARE UNIFORMITY
No issues detected.

PERFORMANCE
The following performance issues were detected:
1. Processes using high CPU.
8 nodes: node[0-7]

SOFTWARE UNIFORMITY
The following software uniformity issues were detected:
1. Environment variables are not uniform across the nodes.
8 nodes: node[0-7]

See the following files for more information: clck_results.log, clck_execution_warnings.log
[minho@node0 Documents]$

 

--------------------------------------------------------------------------------------------------------

When I ran this command, I used whole nodes to run my code.

That's why I think "1. Processes using high CPU." message came out.

But I don't understand the second one. "1. Environment variables are not uniform across the nodes."

If you can explain this error message, it would be nice clue to solve my problem.

Thank you for your kind reply.

 

Minho.

 

 

 

 

 

 

0 Kudos
lmh
Beginner
2,229 Views

As I mentioned on the previous message, 'PERFORMANCE - The following performance issues were detected: 1. Processes using high CPU.' problem comes from running the other code.

 

Recently, I finished my calculation and re-check my cluster by using Intel cluster checker tool.

 

[minho@node0 Documents]$ clck -f ~/machine
Intel(R) Cluster Checker 2021 Update 1 (build 20201104)

Running Collect

................................................................................................................................................................................................................
Running Analyze

SUMMARY
Command-line: clck -f /home/minho/machine
Tests Run: health_base
**WARNING**: 1 test failed to run. Information may be incomplete. See clck_execution_warnings.log for more information.
Overall Result: 1 issue found - SOFTWARE UNIFORMITY (1)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
8 nodes tested: node[0-7]
0 nodes with no issues:
8 nodes with issues: node[0-7]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
FUNCTIONALITY
No issues detected.

HARDWARE UNIFORMITY
No issues detected.

PERFORMANCE
No issues detected.

SOFTWARE UNIFORMITY
The following software uniformity issues were detected:
1. Environment variables are not uniform across the nodes.
8 nodes: node[0-7]

See the following files for more information: clck_results.log, clck_execution_warnings.log

 

 

 

As you can see, performance issue is now solved.

0 Kudos
PrasanthD_intel
Moderator
2,201 Views

Hi Lee,


There are some inaccuracies in the execution command you have provided:

mpirun -host node0,node1,node2,node3,node4,node5,node6,node7 ./hello_mpi


Here the host should be plural and be replaced with hosts and there are options for specifying the process count and node count like -n and -ppn.

Please use this command instead:

mpirun -hosts node0,node1,node2,node3,node4,node5,node6,node7 -np 80 -ppn 10  ./hello_mpi


The cluster is reporting "Environment variables are not uniform across the nodes." please make sure that all MPI related environmental variables are same across nodes.

Since you made your own cluster there should be some things kept in mind before running MPI. Like:

i) All the nodes should have MPI installed and also in the same location.

ii)All nodes should have the binary in the same location.

iii) A common shared memory across nodes would be preferred


For complete information, we have an article (Micro-Cluster Setup with Intel® MPI Library for Windows* ) on how to set up a mini-cluster except it is for windows but the process is almost the same. You can refer to that.


Regards

Prasanth


0 Kudos
PrasanthD_intel
Moderator
2,179 Views

Hi Lee,


Is your problem resolved? Did the given article about setting up micro-clusters help?


Let us know.

Regards

Prasanth


0 Kudos
PrasanthD_intel
Moderator
2,147 Views

Hi Lee,


We are closing this thread assuming your issue has been resolved.

We will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.


Regards

Prasanth


0 Kudos
lmh
Beginner
2,012 Views

Recently, I finally solve my problem.

 

I found that MPI is working well if I make machine vim file:

 

[minho@node0 Documents]$ vim ~/machine

//inside the vim file..

node1

node2

node3

node4

node5

node6

node7

 

Because of that, I realized node0 is the reason of problem.

 

I solve this problem by using this command: "export FI_TCP_IFACE=enp4s0"

 

node0 have 2 internet ports. The first is enp3s0 and the other is enp4s0. I use enp3s0 for internet connection and enp4s0 as internal connection with the other nodes.

 

I think that "export FI_TCP_IFACE=enp4s0" command make enp4s0 as the only connection between the other nodes.

 

Sorry for my late reply. 

 

0 Kudos
Reply