- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

We are running Intel cluster checker with cluster having head node and 28 compute nodes. One of the test, named "intel_mpi" fails for our cluster.It executes all commands successfully still cluster checker says Hello World subtest failed saying No One returned Hello World.

I have copied log output for thi test case:

1> intel_mpi.debug.1881

output returned by node 2:

Hello world: rank 0 of 8 running on compute-0-2.local

Hello world: rank 1 of 8 running on compute-0-2.local

Hello world: rank 2 of 8 running on compute-0-2.local

Hello world: rank 3 of 8 running on compute-0-2.local

Hello world: rank 4 of 8 running on compute-0-2.local

Hello world: rank 5 of 8 running on compute-0-2.local

Hello world: rank 6 of 8 running on compute-0-2.local

Hello world: rank 7 of 8 running on compute-0-2.local

Same output is received in log for all compute nodes.

2> intel_mpi-20090226.065009.out

Intel MPI Library (Single-node), (intel_mpi, 1.6)................................................................FAILED

subtest 'MPI Hello World! (I_MPI_DEVICE = rdssm)' failed

- failing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,

compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,

compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,

compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,

compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G returned: 'No one returned Hello World!'

subtest 'MPI Hello World Compilation' passed

- passing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,

compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,

compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,

compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,

compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G returned: 'no compiler warnings or errors'

subtest 'Permissions on $HOME/.mpd.conf' passed

- passing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,

compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,

compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,

compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,

compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G returned: 'permissions = 600'

subtest 'mpd shutdown' passed

- passing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,

compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,

compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,

compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,

compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G

subtest 'mpd startup' passed

- passing hosts compute-0-0-10G, compute-0-1-10G, compute-0-10-10G, compute-0-11-10G, compute-0-12-10G,

compute-0-13-10G, compute-0-14-10G, compute-0-15-10G, compute-0-16-10G, compute-0-17-10G, compute-0-18-10G,

compute-0-19-10G, compute-0-2-10G, compute-0-20-10G, compute-0-21-10G, compute-0-22-10G, compute-0-23-10G,

compute-0-25-10G, compute-0-26-10G, compute-0-27-10G, compute-0-3-10G, compute-0-4-10G, compute-0-5-10G,

compute-0-6-10G, compute-0-7-10G, compute-0-8-10G, compute-0-9-10G

Let me know if more information is required.

Awaiting a response.

Thanks,

Jolly Shah

Link Copied

3 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

That particular module simply tests the Intel MPI installation on the cluster by running a "Hello World" program across all nodes. Unforturnately, the information in the output file is not enough to know the root cause of the problem. All the error is saying is that none of the other nodes returned "Hello World". This could be because MPD daemons weren't started on the nodes, or Intel MPI wasn't able to connect to them, etc.

Could you rerun only this module with debug enabled and provide the output? To do so, you can use:

`$ cluster-check --debug --include_only intel_mpi`

Thanks,

~Gergana

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Just a quick amendment to my earlier post. The Intel Cluster Checker output can be quite verbose so feel free to attach an outputfile instead of copy and paste, or send me a direct e-mail with the file.

Regards,

~Gergana

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Quoting - Gergana Slavova (Intel)

*Hi Jolly,*

Just a quick amendment to my earlier post. The Intel Cluster Checker output can be quite verbose so feel free to attach an outputfile instead of copy and paste, or send me a direct e-mail with the file.

Regards,

~Gergana

Just a quick amendment to my earlier post. The Intel Cluster Checker output can be quite verbose so feel free to attach an outputfile instead of copy and paste, or send me a direct e-mail with the file.

Regards,

~Gergana

Hi Gergana,

Thanks for your reply.

Please find debug logs and output files attached herewith.

1> test.c - Hello World Application used by intel_mpi test

2> intel_mpi.debug.1881 - Debug log which shows all commands are executed successfully

3> intel_mpi-20090226.065009.out - Cluster checker log which shows test failed

It seems that test is executed successfully but still checker shows fail status.

I would to mention 2 things,

1> I have two NICs present on each of the node, Intel 1G card and another 10G card. These behavior is observed while running with 10G interface. With Intel interface, test passes.

2> Sometimes (very rarely) test passes too. Aslo similar kind of test, intel_mpi_rt, passes for me but sometimes running consecutively it fails too. So it seems timing issue.

Let me know if you need more information.

Thanks in advance,

Jolly Shah

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page