Analyzing Hanging Mpi Applications

Amit1 · ‎03-04-2021

Hello,

I am seeing an issue with a specific machine network, where our IntelMpi based applications (including MPI ring application) are hanging when trying to send messages between certain hosts in that network.

For most of the other host combinations in this network, our applications are running fine.

Even for the hanging runs the ring application is able to send messages fine across non-problematic hosts before it hits the hang.

I need help in figuring out what could be possibly wrong here that is blocking MPI message sharing between these problematic host pairings.

I have included “lsb_release”, “netstat –nr” and “traceroute” outputs corresponding to these problematic hosts here.

Host : BrokenPairHost-A (lsb_release and netstat)

LSB Version: core-5.0-amd64:core-5.0-noarch:desktop-5.0-amd64:desktop-5.0-noarch:imaging-5.0-amd64:imaging-5.0-noarch:languages-5.0-amd64:languages-5.0-noarch

Distributor ID: SUSE

Description: SUSE Linux Enterprise Server 12 SP4

Release: 12.4

Codename: n/a

Kernel IP routing table

Destination Gateway Genmask Flags MSS Window irtt Iface

0.0.0.0 10.43.16.1 0.0.0.0 UG 0 0 0 bond0

10.43.16.0 0.0.0.0 255.255.252.0 U 0 0 0 bond0

----------

Host : BrokenPairHost-B (lsb_release and netstat)

LSB Version: core-5.0-amd64:core-5.0-noarch:desktop-5.0-amd64:desktop-5.0-noarch:imaging-5.0-amd64:imaging-5.0-noarch:languages-5.0-amd64:languages-5.0-noarch

Distributor ID: SUSE

Description: SUSE Linux Enterprise Server 12 SP4

Release: 12.4

Codename: n/a

Kernel IP routing table

Destination Gateway Genmask Flags MSS Window irtt Iface

0.0.0.0 10.23.182.1 0.0.0.0 UG 0 0 0 bond0

10.23.182.0 0.0.0.0 255.255.254.0 U 0 0 0 bond0

----------

Host : BrokenPairHost-A

traceroute to BrokenPairHost-B (10.23.182.60), 30 hops max, 60 byte packets

1 host1.XXX.com (10.43.16.3) 0.199 host2.XXX.com (10.43.16.2) 0.104 ms host1.XXX.com (10.43.16.3) 0.189 ms

2 host3.XXX.com (10.23.3.22) 0.115 ms 0.202 ms host4.XXX.com (10.23.3.21) 0.182 ms

3 host5.XXX.com (10.23.253.81) 0.160 ms host6.XXX.com (10.23.253.77) 0.194 ms host7.XXX.com (10.23.253.89) 0.235 ms

4 host8.XXX.com (10.23.253.106) 0.187 ms host9.XXX.com (10.23.253.122) 0.163 ms host10.XXX.com (10.23.253.126) 0.122 ms

5 10.29.255.43 (10.29.255.43) 0.206 ms 10.29.255.51 (10.29.255.51) 0.267 ms 10.29.255.41 (10.29.255.41) 0.250 ms

6 BrokenPairHost-B.XXX.com (10.23.182.60) 0.162 ms 0.246 ms 0.198 ms

Any help on this matter would be greatly appreciated.

Thanks,

_Amit

AbhishekD_Intel · ‎03-08-2021

Hi Amit,

Thanks for reaching out to us.

Please share with us the details of the Intel MPI version and the interconnect you are using.

It seems that it's a hardware issue as the same sample is working on some other machines in your case.

Please cross-check if the passwordless ssh is enabled for the node(which hangs) which you are using. Also, send us logs of the Intel cluster checker tool, which will help us to identify if the problem is with any of the nodes in your cluster. Please refer to the below link to collect the logs from the cluster checker.

https://software.intel.com/content/www/us/en/develop/documentation/cluster-checker-user-guide/top/getting-started.html

Please provide us the above details, so that we will get more insight into your issue.

Warm Regards,

Abhishek

AbhishekD_Intel · ‎03-15-2021

Hi Amit,

Please give us an update on the provided details.

Amit1 · ‎03-15-2021

Hi Abhishek,

Thanks a lot for your reply.

This issue is occurring for us in a network that has restricted access.

We will get back to you once we have more information to share with you.

Thanks,
_Amit

AbhishekD_Intel · ‎03-19-2021

Hi Amit,

Thanks for the update, do let us know your findings as soon as you get the corresponding details.

Warm Regards,

Abhishek

Parviz · ‎03-24-2021

Hi Abhishek,

I am following up on Amit's earlier query.

I have installed the Cluster Checker tool and tried to run it in a cluster that does not support password-less ssh/rsh. I followed the direction in the installation docs, modified the config file to use mpirun instead , then ran the clck command with the modified config file. It failed. Below is the tail of running the command with -l debug switch :

The command 'mpirun' has timed out and will be killed

sending terminate signal to process 22294
process 22294 has exited

Any ideas on what I am doing wrong? Is there a way to launch clck through lsf? If so, please provide some details.

One other question: To diagnose the issue that Amit had mentioned, what would be the correct data collection option to clck? The documentation recommends:

clck

<options>

-F

mpi_prereq_user

Are there other options?

Thanks in advance for your help.

-Parviz

AbhishekD_Intel · ‎03-30-2021

Hi Parviz,

Thanks for the details.

It is recommended to establish a passwordless SSH connection between the cluster nodes while using Intel MPI. If you are using passwordless SSH then please do the configuration correctly as mentioned in the prerequisites before using Intel® Cluster Checker.

If you don't have admin privileges then try running simple tests with the clck using below flags:

-Fhealth_base
-Fhealth_user
-Fhealth_extended_user

For the debug info you can use -l debug option with clck and to provide the list of nodes from nodefile use -f nodefile option with clck.

Nodefile should contain the list of all the nodes on which you have to perform cluster checker tests.

You may try the below command and check if there are any issues with the nodes in your cluster.

$ clck -f NODEFILE -l debug -Fhealth_base

Note: NODEFILE contains the list of nodes to check from your cluster

If you are using a custom config file then you may have to give the path of it using -c FILE option with the clck if it's not at the default location.

For more details regarding options please refer to the below link.

https://software.intel.com/content/www/us/en/develop/documentation/cluster-checker-user-guide/top/configuring-intel-cluster-checker.html

Try running the above-mentioned flags/options and check if there is any problem with the cluster nodes, and send us the result logs of clck to get more insight into your issues.

Warm Regards,

Abhishek

AbhishekD_Intel · ‎04-07-2021

Hi,

Please give us an update on the provided details.

Warm Regards,

Abhishek

AbhishekD_Intel · ‎04-16-2021

Hi,

As we haven't heard back from you, we are considering that your issue has been resolved.

We will no longer monitor this thread. If you require any additional assistance from Intel, please start a new thread.

Any further interaction in this thread will be considered community only.

Warm Regards,

Abhishek

Amit1 · ‎04-16-2021

Hi Abhishek,

Just to let you know, this issue has not been resolved.

We are still trying to find our way to get the cluster checker working in absence of password-less ssh.

It will be helpful if you can point us to the relevant documentation that talks about running cluster checker when jobs on remote machines can only be launched using grid or lsf.

Thanks & Regards,

__AMIT