Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2159 Discussions

Alternate format in "-machinefile" causes problems

Geoff_Hall
Beginner
1,719 Views

I was using -machinefile to direct the number of processes to run on different hosts. It works when I use the compact form, but fails when I use the expanded form. Both formats (I believed) are equally valid.

The command to run the processes is identical for the two cases:

mpiexec -l -genvnone -machinefile mpimach.txt -n 5 hello.exe

machinefile version:

Works:

- - - - - - - - - - - - - - - - -

geoff:3

study:2

- - - - - - - - - - - - - - - - -

 

Fails:

- - - - - - - - - - - - - - - - -

geoff

study

geoff

study

geoff

- - - - - - - - - - - - - - - - -

I was attempting to use the expanded form to ensure the processes would be evenly distributed between the two hosts when n was < 5.

The symptoms of the failure are that geoff has three copies of hello.exe running simultaneously and study has two copies of hello running simultaneously and none the programs complete. The behaviour is as though there is a lock deadly embrace. If I kill any one of the five programs, the whole suite terminates - with error, naturally enough.

Can anyone shed any light as to what is happening?

TIA

Cheers, Geoff

0 Kudos
13 Replies
James_T_Intel
Moderator
1,719 Views
Hi Geoff,

I'm not sure why the second machine file form isn't working, it should be. I'm setting up some test systems to try and replicate the scenario.

As a side note, with a typical MPI job, if one application ends, all of them will be ended. There are ways to end a single process in the job, but it must be handled within the application so that it safely exits the job. This requires disconnecting the processes from each other. If you want more information on this, I would recommend starting withChapter 10of the MPI 2.2 Standard (available at http://mpi-forum.org/docs/docs.html), specifically section 10.5.4.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
Geoff_Hall
Beginner
1,719 Views
Hi James,

Thanks for looking into the -machinefile issue.

As you had stated to me in another thread (and my own observations) I knew that all apps would end when one ended badly. Our "real" app will be able to deal with that situation well because it traps most/all situations that could cause a "crash". Your info above that that there are other ways of dealing with such situations is new. Thanks. I'll do some (more) reading!

Cheers, Geoff
0 Kudos
Dmitry_K_Intel2
Employee
1,719 Views
Hi Geoff,

You probably need to look at the chapter 5 in the Intel MPI Refence Manual - this is about failt tolerance.

Regards!
Dmitry

0 Kudos
James_T_Intel
Moderator
1,719 Views
Hi Geoff,

I've reproduced what you've seen. For additional clarification, if you use a machinefile such as:

[plain]geoff
geoff
study
study
[/plain]

You should see no problems. The error is when you have switched hosts and return to one previously mentioned, as:

[plain]geoff
study
geoff
study
[/plain]

This will cause a hang. This also happens when you are using a configuration file that changes back to an earlier host. I will be filing a defect report on this behavior.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
Geoff_Hall
Beginner
1,719 Views
Yes James, that's the situation.
Thanks for submitting the bug report. Any (approximate) idea before we might see a fix?

->sigh<- I guess I'll have to find a workaround in the meantime!

Cheers, Geoff
0 Kudos
Geoff_Hall
Beginner
1,719 Views
Hi James,

As an addendum to the above (it might be part of the same bug; additional information or something different) I've noticed that if I use a machinefile with

geoff:2
study:2

and -n 5 (i.e. one more than the number of host processors specified in the machinefile) the whole job hangs as well.

Cheers, Geoff
0 Kudos
James_T_Intel
Moderator
1,719 Views
Hi Geoff,

I do not have a timeline for a fix at the moment. I will add this information to the report.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
Geoff_Hall
Beginner
1,719 Views
Thanks James.
Geoff
0 Kudos
James_T_Intel
Moderator
1,719 Views
Hi Geoff,

Please check the firewall on both computers and make sure both are allowing the program you are running through.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
Geoff_Hall
Beginner
1,719 Views
Hi James,

I had had to check that.

1. To get it to work in Win7 I had to turn off Windows Firewall, even though I had "allow" defined for both smpd and mpiexec. Don't know what's going on there. By my understanding it should be working. WinXP firewall is managed by Sophos and that works fine.

2. When the host names don't repeat themselves the mpiexec job runs/works, so I assume the firewall settings are satisfactory (but as I said, I have to turn off Win7 Windows Firewall).

3. So the hostnames interleaving themselves is the only difference when the job fails.

Cheers, Geoff
0 Kudos
James_T_Intel
Moderator
1,719 Views
Hi Geoff,

In my Windows* 7 firewall, I was able to allow smpd, mpiexec, and the program I was running through, but leave the rest enabled. I did this on both computers, and it worked.

What mechanism is used by the Sophos* firewall (port, program, packet inspection, etc.)? Would you be willing (and able) to turn it completely off temporarily to try running with both firewalls disabled?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
Geoff_Hall
Beginner
1,719 Views
Hi James,

Windows 7 (the 'geoff' computer) Firewall is the problem for me. I have worked around the issue by turning it off already - but it can't be a permanent solution. By the way, the firewall problem is that it stops an mpiexec job initiated on another machine (the 'study' computer) from running on this machine. It doesn't stop mpiexec from running on 'geoff' and sending jobs to 'study'.

Sophos, on the WinXP machine ('study') is fine. I've added smpd and mpiexec to the "checksums in the firewall definition" (that's all I have to do) and it works as I would expect it to do. Is it causing or contributing to this (thread topic) problem? No. I checked (turned it off) and the results were unchanged from the what I described at the start of this thread.

Cheers, Geoff
0 Kudos
James_T_Intel
Moderator
1,719 Views
Hi Geoff,

Thanks for checking the firewalls.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
Reply