- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I've recently setup a new cluster that uses slurm for resource allocation. Upon starting a new mpi job with:
$ salloc -n 32 sh
$ mpirun -np 32 -nolocal a.out
I get the following errors:
failed to connect to the socket (sock2): {socket.gaierror, (-2, 'Name or service not known')}. Probable reason: host "node2" is invalid
(mpdboot 494): failed to connect to the socket (sock2): {socket.error, (9, 'Bad file descriptor')}. Probable reason: host "node2" is invalid
totalnum=3 numhosts=2
there are not enough hosts on which to start all processes
I then checked to see if slurm nodelist is set properly with:
$ echo $SLURM_NODELIST
node[01-02]
and to confirm its intel mpi:
$ which mpirun
/opt/intel/impi/4.0.0.028/intel64/bin/mpirun
The nodes below 10 have a leading zero in their hostname. It looks like mpirun is dropping this leading zero in the hostname.
Am I missing something really simple and there is a straightforward fix?
I've recently setup a new cluster that uses slurm for resource allocation. Upon starting a new mpi job with:
$ salloc -n 32 sh
$ mpirun -np 32 -nolocal a.out
I get the following errors:
failed to connect to the socket (sock2): {socket.gaierror, (-2, 'Name or service not known')}. Probable reason: host "node2" is invalid
(mpdboot 494): failed to connect to the socket (sock2): {socket.error, (9, 'Bad file descriptor')}. Probable reason: host "node2" is invalid
totalnum=3 numhosts=2
there are not enough hosts on which to start all processes
I then checked to see if slurm nodelist is set properly with:
$ echo $SLURM_NODELIST
node[01-02]
and to confirm its intel mpi:
$ which mpirun
/opt/intel/impi/4.0.0.028/intel64/bin/mpirun
The nodes below 10 have a leading zero in their hostname. It looks like mpirun is dropping this leading zero in the hostname.
Am I missing something really simple and there is a straightforward fix?
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi ccladmin,
>It looks like mpirun is dropping this leading zero in the hostname.
Unfortunately this is an mpirun's bug.
Could you rename your nodes starting from 100 (or maybe from 1000) - it's the easiest workaround for this issue so far.
Regards!
Dmitry
>It looks like mpirun is dropping this leading zero in the hostname.
Unfortunately this is an mpirun's bug.
Could you rename your nodes starting from 100 (or maybe from 1000) - it's the easiest workaround for this issue so far.
Regards!
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ccladmin,
Could you submit a tracker through Premier Support so I will be able to provide you a patch.
Regards!
Dmitry
Could you submit a tracker through Premier Support so I will be able to provide you a patch.
Regards!
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dmitry,
I'm glad to know it was a known problem.
Thanks for your assitance. I have submited the tracker in Premier Support as you requested.
I'm glad to know it was a known problem.
Thanks for your assitance. I have submited the tracker in Premier Support as you requested.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page