- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am seeing an issue with a specific machine network, where our IntelMpi based applications (including MPI ring application) are hanging when trying to send messages between certain hosts in that network.
For most of the other host combinations in this network, our applications are running fine.
Even for the hanging runs the ring application is able to send messages fine across non-problematic hosts before it hits the hang.
I need help in figuring out what could be possibly wrong here that is blocking MPI message sharing between these problematic host pairings.
I have included “lsb_release”, “netstat –nr” and “traceroute” outputs corresponding to these problematic hosts here.
Host : BrokenPairHost-A (lsb_release and netstat)
LSB Version: core-5.0-amd64:core-5.0-noarch:desktop-5.0-amd64:desktop-5.0-noarch:imaging-5.0-amd64:imaging-5.0-noarch:languages-5.0-amd64:languages-5.0-noarch
Distributor ID: SUSE
Description: SUSE Linux Enterprise Server 12 SP4
Release: 12.4
Codename: n/a
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
0.0.0.0 10.43.16.1 0.0.0.0 UG 0 0 0 bond0
10.43.16.0 0.0.0.0 255.255.252.0 U 0 0 0 bond0
----------
Host : BrokenPairHost-B (lsb_release and netstat)
LSB Version: core-5.0-amd64:core-5.0-noarch:desktop-5.0-amd64:desktop-5.0-noarch:imaging-5.0-amd64:imaging-5.0-noarch:languages-5.0-amd64:languages-5.0-noarch
Distributor ID: SUSE
Description: SUSE Linux Enterprise Server 12 SP4
Release: 12.4
Codename: n/a
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
0.0.0.0 10.23.182.1 0.0.0.0 UG 0 0 0 bond0
10.23.182.0 0.0.0.0 255.255.254.0 U 0 0 0 bond0
----------
Host : BrokenPairHost-A
traceroute to BrokenPairHost-B (10.23.182.60), 30 hops max, 60 byte packets
1 host1.XXX.com (10.43.16.3) 0.199 host2.XXX.com (10.43.16.2) 0.104 ms host1.XXX.com (10.43.16.3) 0.189 ms
2 host3.XXX.com (10.23.3.22) 0.115 ms 0.202 ms host4.XXX.com (10.23.3.21) 0.182 ms
3 host5.XXX.com (10.23.253.81) 0.160 ms host6.XXX.com (10.23.253.77) 0.194 ms host7.XXX.com (10.23.253.89) 0.235 ms
4 host8.XXX.com (10.23.253.106) 0.187 ms host9.XXX.com (10.23.253.122) 0.163 ms host10.XXX.com (10.23.253.126) 0.122 ms
5 10.29.255.43 (10.29.255.43) 0.206 ms 10.29.255.51 (10.29.255.51) 0.267 ms 10.29.255.41 (10.29.255.41) 0.250 ms
6 BrokenPairHost-B.XXX.com (10.23.182.60) 0.162 ms 0.246 ms 0.198 ms
Any help on this matter would be greatly appreciated.
Thanks,
_Amit
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Amit,
Thanks for reaching out to us.
Please share with us the details of the Intel MPI version and the interconnect you are using.
It seems that it's a hardware issue as the same sample is working on some other machines in your case.
Please cross-check if the passwordless ssh is enabled for the node(which hangs) which you are using. Also, send us logs of the Intel cluster checker tool, which will help us to identify if the problem is with any of the nodes in your cluster. Please refer to the below link to collect the logs from the cluster checker.
Please provide us the above details, so that we will get more insight into your issue.
Warm Regards,
Abhishek
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Amit,
Please give us an update on the provided details.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Abhishek,
Thanks a lot for your reply.
This issue is occurring for us in a network that has restricted access.
We will get back to you once we have more information to share with you.
Thanks,
_Amit
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Amit,
Thanks for the update, do let us know your findings as soon as you get the corresponding details.
Warm Regards,
Abhishek
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Abhishek,
I am following up on Amit's earlier query.
I have installed the Cluster Checker tool and tried to run it in a cluster that does not support password-less ssh/rsh. I followed the direction in the installation docs, modified the config file to use mpirun instead , then ran the clck command with the modified config file. It failed. Below is the tail of running the command with -l debug switch :
The command 'mpirun' has timed out and will be killed
sending terminate signal to process 22294
process 22294 has exited
Any ideas on what I am doing wrong? Is there a way to launch clck through lsf? If so, please provide some details.
One other question: To diagnose the issue that Amit had mentioned, what would be the correct data collection option to clck? The documentation recommends:
Thanks in advance for your help.
-Parviz
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Parviz,
Thanks for the details.
It is recommended to establish a passwordless SSH connection between the cluster nodes while using Intel MPI. If you are using passwordless SSH then please do the configuration correctly as mentioned in the prerequisites before using Intel® Cluster Checker.
If you don't have admin privileges then try running simple tests with the clck using below flags:
- -Fhealth_base
- -Fhealth_user
- -Fhealth_extended_user
For the debug info you can use -l debug option with clck and to provide the list of nodes from nodefile use -f nodefile option with clck.
Nodefile should contain the list of all the nodes on which you have to perform cluster checker tests.
You may try the below command and check if there are any issues with the nodes in your cluster.
$ clck -f NODEFILE -l debug -Fhealth_base
Note: NODEFILE contains the list of nodes to check from your cluster
If you are using a custom config file then you may have to give the path of it using -c FILE option with the clck if it's not at the default location.
For more details regarding options please refer to the below link.
Try running the above-mentioned flags/options and check if there is any problem with the cluster nodes, and send us the result logs of clck to get more insight into your issues.
Warm Regards,
Abhishek
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Please give us an update on the provided details.
Warm Regards,
Abhishek
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
As we haven't heard back from you, we are considering that your issue has been resolved.
We will no longer monitor this thread. If you require any additional assistance from Intel, please start a new thread.
Any further interaction in this thread will be considered community only.
Warm Regards,
Abhishek
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Abhishek,
Just to let you know, this issue has not been resolved.
We are still trying to find our way to get the cluster checker working in absence of password-less ssh.
It will be helpful if you can point us to the relevant documentation that talks about running cluster checker when jobs on remote machines can only be launched using grid or lsf.
Thanks & Regards,
__AMIT
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page