- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are running Intel cluster checker with cluster having head node and 28 compute nodes. One of the test, named "intel_mpi" fails for our cluster.It executes all commands successfully still cluster checker says Hello World subtest failed saying No One returned Hello World.
I have copied log output for thi test case:
1> intel_mpi.debug.1881
output returned by node 2:
Hello world: rank 0 of 8 running on compute-0-2.local
Hello world: rank 1 of 8 running on compute-0-2.local
Hello world: rank 2 of 8 running on compute-0-2.local
Hello world: rank 3 of 8 running on compute-0-2.local
Hello world: rank 4 of 8 running on compute-0-2.local
Hello world: rank 5 of 8 running on compute-0-2.local
Hello world: rank 6 of 8 running on compute-0-2.local
Hello world: rank 7 of 8 running on compute-0-2.local
Same output is received in log for all compute nodes.
2> intel_mpi-20090226.065009.out
Intel MPI Library (Single-node), (intel_mpi, 1.6)................................................................FAILED
subtest 'MPI Hello World! (I_MPI_DEVICE = rdssm)' failed
- failing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,
compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,
compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,
compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,
compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G returned: 'No one returned Hello World!'
subtest 'MPI Hello World Compilation' passed
- passing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,
compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,
compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,
compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,
compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G returned: 'no compiler warnings or errors'
subtest 'Permissions on $HOME/.mpd.conf' passed
- passing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,
compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,
compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,
compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,
compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G returned: 'permissions = 600'
subtest 'mpd shutdown' passed
- passing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,
compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,
compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,
compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,
compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G
subtest 'mpd startup' passed
- passing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,
compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,
compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,
compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,
compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G
Let me know if more information is required.
Awaiting a response.
Thanks,
Jolly Shah
We are running Intel cluster checker with cluster having head node and 28 compute nodes. One of the test, named "intel_mpi" fails for our cluster.It executes all commands successfully still cluster checker says Hello World subtest failed saying No One returned Hello World.
I have copied log output for thi test case:
1> intel_mpi.debug.1881
output returned by node 2:
Hello world: rank 0 of 8 running on compute-0-2.local
Hello world: rank 1 of 8 running on compute-0-2.local
Hello world: rank 2 of 8 running on compute-0-2.local
Hello world: rank 3 of 8 running on compute-0-2.local
Hello world: rank 4 of 8 running on compute-0-2.local
Hello world: rank 5 of 8 running on compute-0-2.local
Hello world: rank 6 of 8 running on compute-0-2.local
Hello world: rank 7 of 8 running on compute-0-2.local
Same output is received in log for all compute nodes.
2> intel_mpi-20090226.065009.out
Intel MPI Library (Single-node), (intel_mpi, 1.6)................................................................FAILED
subtest 'MPI Hello World! (I_MPI_DEVICE = rdssm)' failed
- failing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,
compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,
compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,
compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,
compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G returned: 'No one returned Hello World!'
subtest 'MPI Hello World Compilation' passed
- passing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,
compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,
compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,
compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,
compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G returned: 'no compiler warnings or errors'
subtest 'Permissions on $HOME/.mpd.conf' passed
- passing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,
compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,
compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,
compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,
compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G returned: 'permissions = 600'
subtest 'mpd shutdown' passed
- passing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,
compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,
compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,
compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,
compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G
subtest 'mpd startup' passed
- passing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,
compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,
compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,
compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,
compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G
Let me know if more information is required.
Awaiting a response.
Thanks,
Jolly Shah
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jolly,
That particular module simply tests the Intel MPI installation on the cluster by running a "Hello World" program across all nodes. Unforturnately, the information in the output file is not enough to know the root cause of the problem. All the error is saying is that none of the other nodes returned "Hello World". This could be because MPD daemons weren't started on the nodes, or Intel MPI wasn't able to connect to them, etc.
Could you rerun only this module with debug enabled and provide the output? To do so, you can use:
Thanks,
~Gergana
That particular module simply tests the Intel MPI installation on the cluster by running a "Hello World" program across all nodes. Unforturnately, the information in the output file is not enough to know the root cause of the problem. All the error is saying is that none of the other nodes returned "Hello World". This could be because MPD daemons weren't started on the nodes, or Intel MPI wasn't able to connect to them, etc.
Could you rerun only this module with debug enabled and provide the output? To do so, you can use:
$ cluster-check --debug --include_only intel_mpi
Thanks,
~Gergana
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jolly,
Just a quick amendment to my earlier post. The Intel Cluster Checker output can be quite verbose so feel free to attach an outputfile instead of copy and paste, or send me a direct e-mail with the file.
Regards,
~Gergana
Just a quick amendment to my earlier post. The Intel Cluster Checker output can be quite verbose so feel free to attach an outputfile instead of copy and paste, or send me a direct e-mail with the file.
Regards,
~Gergana
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Gergana Slavova (Intel)
Hi Jolly,
Just a quick amendment to my earlier post. The Intel Cluster Checker output can be quite verbose so feel free to attach an outputfile instead of copy and paste, or send me a direct e-mail with the file.
Regards,
~Gergana
Just a quick amendment to my earlier post. The Intel Cluster Checker output can be quite verbose so feel free to attach an outputfile instead of copy and paste, or send me a direct e-mail with the file.
Regards,
~Gergana
Hi Gergana,
Thanks for your reply.
Please find debug logs and output files attached herewith.
1> test.c - Hello World Application used by intel_mpi test
2> intel_mpi.debug.1881 - Debug log which shows all commands are executed successfully
3> intel_mpi-20090226.065009.out - Cluster checker log which shows test failed
It seems that test is executed successfully but still checker shows fail status.
I would to mention 2 things,
1> I have two NICs present on each of the node, Intel 1G card and another 10G card. These behavior is observed while running with 10G interface. With Intel interface, test passes.
2> Sometimes (very rarely) test passes too. Aslo similar kind of test, intel_mpi_rt, passes for me but sometimes running consecutively it fails too. So it seems timing issue.
Let me know if you need more information.
Thanks in advance,
Jolly Shah
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page