- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was using -machinefile to direct the number of processes to run on different hosts. It works when I use the compact form, but fails when I use the expanded form. Both formats (I believed) are equally valid.
The command to run the processes is identical for the two cases:
mpiexec -l -genvnone -machinefile mpimach.txt -n 5 hello.exe
machinefile version:
Works:
- - - - - - - - - - - - - - - - -
geoff:3
study:2
- - - - - - - - - - - - - - - - -
Fails:
- - - - - - - - - - - - - - - - -
geoff
study
geoff
study
geoff
- - - - - - - - - - - - - - - - -
I was attempting to use the expanded form to ensure the processes would be evenly distributed between the two hosts when n was < 5.
The symptoms of the failure are that geoff has three copies of hello.exe running simultaneously and study has two copies of hello running simultaneously and none the programs complete. The behaviour is as though there is a lock deadly embrace. If I kill any one of the five programs, the whole suite terminates - with error, naturally enough.
Can anyone shed any light as to what is happening?
TIA
Cheers, Geoff
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure why the second machine file form isn't working, it should be. I'm setting up some test systems to try and replicate the scenario.
As a side note, with a typical MPI job, if one application ends, all of them will be ended. There are ways to end a single process in the job, but it must be handled within the application so that it safely exits the job. This requires disconnecting the processes from each other. If you want more information on this, I would recommend starting withChapter 10of the MPI 2.2 Standard (available at http://mpi-forum.org/docs/docs.html), specifically section 10.5.4.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for looking into the -machinefile issue.
As you had stated to me in another thread (and my own observations) I knew that all apps would end when one ended badly. Our "real" app will be able to deal with that situation well because it traps most/all situations that could cause a "crash". Your info above that that there are other ways of dealing with such situations is new. Thanks. I'll do some (more) reading!
Cheers, Geoff
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You probably need to look at the chapter 5 in the Intel MPI Refence Manual - this is about failt tolerance.
Regards!
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've reproduced what you've seen. For additional clarification, if you use a machinefile such as:
[plain]geoff geoff study study [/plain]
You should see no problems. The error is when you have switched hosts and return to one previously mentioned, as:
[plain]geoff study geoff study [/plain]
This will cause a hang. This also happens when you are using a configuration file that changes back to an earlier host. I will be filing a defect report on this behavior.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for submitting the bug report. Any (approximate) idea before we might see a fix?
->sigh<- I guess I'll have to find a workaround in the meantime!
Cheers, Geoff
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As an addendum to the above (it might be part of the same bug; additional information or something different) I've noticed that if I use a machinefile with
geoff:2
study:2
and -n 5 (i.e. one more than the number of host processors specified in the machinefile) the whole job hangs as well.
Cheers, Geoff
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I do not have a timeline for a fix at the moment. I will add this information to the report.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Geoff
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please check the firewall on both computers and make sure both are allowing the program you are running through.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I had had to check that.
1. To get it to work in Win7 I had to turn off Windows Firewall, even though I had "allow" defined for both smpd and mpiexec. Don't know what's going on there. By my understanding it should be working. WinXP firewall is managed by Sophos and that works fine.
2. When the host names don't repeat themselves the mpiexec job runs/works, so I assume the firewall settings are satisfactory (but as I said, I have to turn off Win7 Windows Firewall).
3. So the hostnames interleaving themselves is the only difference when the job fails.
Cheers, Geoff
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In my Windows* 7 firewall, I was able to allow smpd, mpiexec, and the program I was running through, but leave the rest enabled. I did this on both computers, and it worked.
What mechanism is used by the Sophos* firewall (port, program, packet inspection, etc.)? Would you be willing (and able) to turn it completely off temporarily to try running with both firewalls disabled?
Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Windows 7 (the 'geoff' computer) Firewall is the problem for me. I have worked around the issue by turning it off already - but it can't be a permanent solution. By the way, the firewall problem is that it stops an mpiexec job initiated on another machine (the 'study' computer) from running on this machine. It doesn't stop mpiexec from running on 'geoff' and sending jobs to 'study'.
Sophos, on the WinXP machine ('study') is fine. I've added smpd and mpiexec to the "checksums in the firewall definition" (that's all I have to do) and it works as I would expect it to do. Is it causing or contributing to this (thread topic) problem? No. I checked (turned it off) and the results were unchanged from the what I described at the start of this thread.
Cheers, Geoff
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for checking the firewalls.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page