Community
cancel
Showing results for 
Search instead for 
Did you mean: 
jollyshah
Beginner
53 Views

Issue with Intel cluster checker test intel_mpi

Hi,

We are running Intel cluster checker with cluster having head node and 28 compute nodes. One of the test, named "intel_mpi" fails for our cluster.It executes all commands successfully still cluster checker says Hello World subtest failed saying No One returned Hello World.

I have copied log output for thi test case:

1> intel_mpi.debug.1881
output returned by node 2:

Hello world: rank 0 of 8 running on compute-0-2.local
Hello world: rank 1 of 8 running on compute-0-2.local
Hello world: rank 2 of 8 running on compute-0-2.local
Hello world: rank 3 of 8 running on compute-0-2.local
Hello world: rank 4 of 8 running on compute-0-2.local
Hello world: rank 5 of 8 running on compute-0-2.local
Hello world: rank 6 of 8 running on compute-0-2.local
Hello world: rank 7 of 8 running on compute-0-2.local

Same output is received in log for all compute nodes.

2> intel_mpi-20090226.065009.out

Intel MPI Library (Single-node), (intel_mpi, 1.6)................................................................FAILED
subtest 'MPI Hello World! (I_MPI_DEVICE = rdssm)' failed
- failing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,
compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,
compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,
compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,
compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G returned: 'No one returned Hello World!'
subtest 'MPI Hello World Compilation' passed
- passing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,
compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,
compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,
compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,
compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G returned: 'no compiler warnings or errors'
subtest 'Permissions on $HOME/.mpd.conf' passed
- passing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,
compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,
compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,
compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,
compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G returned: 'permissions = 600'
subtest 'mpd shutdown' passed
- passing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,
compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,
compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,
compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,
compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G
subtest 'mpd startup' passed
- passing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,
compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,
compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,
compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,
compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G


Let me know if more information is required.

Awaiting a response.

Thanks,
Jolly Shah
0 Kudos
3 Replies
Gergana_S_Intel
Employee
53 Views

Hi Jolly,

That particular module simply tests the Intel MPI installation on the cluster by running a "Hello World" program across all nodes. Unforturnately, the information in the output file is not enough to know the root cause of the problem. All the error is saying is that none of the other nodes returned "Hello World". This could be because MPD daemons weren't started on the nodes, or Intel MPI wasn't able to connect to them, etc.

Could you rerun only this module with debug enabled and provide the output? To do so, you can use:

$ cluster-check --debug --include_only intel_mpi

Thanks,
~Gergana
Gergana_S_Intel
Employee
53 Views

Hi Jolly,

Just a quick amendment to my earlier post. The Intel Cluster Checker output can be quite verbose so feel free to attach an outputfile instead of copy and paste, or send me a direct e-mail with the file.

Regards,
~Gergana
jollyshah
Beginner
53 Views

Hi Jolly,

Just a quick amendment to my earlier post. The Intel Cluster Checker output can be quite verbose so feel free to attach an outputfile instead of copy and paste, or send me a direct e-mail with the file.

Regards,
~Gergana

Hi Gergana,

Thanks for your reply.
Please find debug logs and output files attached herewith.

1> test.c - Hello World Application used by intel_mpi test

2> intel_mpi.debug.1881 - Debug log which shows all commands are executed successfully

3> intel_mpi-20090226.065009.out - Cluster checker log which shows test failed

It seems that test is executed successfully but still checker shows fail status.
I would to mention 2 things,

1> I have two NICs present on each of the node, Intel 1G card and another 10G card. These behavior is observed while running with 10G interface. With Intel interface, test passes.

2> Sometimes (very rarely) test passes too. Aslo similar kind of test, intel_mpi_rt, passes for me but sometimes running consecutively it fails too. So it seems timing issue.

Let me know if you need more information.

Thanks in advance,

Jolly Shah




Reply