- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Intel MPI version 4.1.3, Slurm version 2.6.9-1
I am trying to follow the Intel MPI documentation to run a job under Slurm with -bootstrap jmi but am
getting the error message as below:
salloc -N 1 :
export I_MPI_HYDRA_JMI_LIBRARY=/opt/intel/impi/4.1.3/lib/intel64/lib/libjmi_slurm.so
mpiexec.hydra -bootstrap slurm -n 2 hostname ## << this works
mpiexec.hydra -bootstrap jmi -n 2 hostname ## <<this does not work
srun: error: Unable to create job step: Requested node configuration is not available
srun: error: Unable to create job step: Requested node configuration is not available
If I look at Slurm logs, it is trying to get a node assignment for the fqdn of the node, even though
I only use short names in slurm.conf. Not sure it this has anything to do with JMI/Slurm interaction.
If I use
I_MPI_PMI_LIBRARY=/opt/slurm/14.03.1-2/lib64/libpmi.so
srun -n 2 mympiprog
it works too.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have been informed by support that this will be fixed in Intel MPI 5.0.2.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you provide the output from the -bootstrap jmi version with I_MPI_HYDRA_DEBUG=1?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi James
Attaching the debug output; the key part seems to be node request by fqdn
[jmi-slurm@builder] Launch arguments: srun --nodelist builder.hpc8888.com -N 1 -n 1 ./hello
- requesting by fqdn
- error message is
srun: error: Unable to create job step: Requested node configuration is not available
Slurm log
[2014-05-08T21:29:02.760] sched: job_complete for JobId=133 successful, exit code=0
[2014-05-08T21:29:09.647] sched: _slurm_rpc_allocate_resources JobId=135 NodeList=builder,ruchba usec=13039
[2014-05-08T21:29:31.848] sched: _slurm_rpc_job_step_create: StepId=135.0 builder,ruchba usec=7566
[2014-05-08T21:29:31.974] sched: _slurm_rpc_step_complete StepId=135.0 usec=11970
[2014-05-08T21:29:47.668] sched: _slurm_rpc_job_step_create: StepId=135.1 builder,ruchba usec=14242
[2014-05-08T21:29:47.750] sched: _slurm_rpc_step_complete StepId=135.1 usec=13498
[2014-05-08T21:29:52.374] sched: _slurm_rpc_job_step_create: StepId=135.2 builder,ruchba usec=15330
[2014-05-08T21:29:52.416] error: find_node_record: lookup failure for builder.hpc8888.com
[2014-05-08T21:29:52.416] error: node_name2bitmap: invalid node specified builder.hpc8888.com
[2014-05-08T21:29:52.416] _pick_step_nodes: invalid node list builder.hpc8888.com
[2014-05-08T21:29:52.416] _slurm_rpc_job_step_create for job 135: Requested node configuration is not available
[2014-05-08T21:29:52.419] error: find_node_record: lookup failure for builder.hpc8888.com
[2014-05-08T21:29:52.419] error: node_name2bitmap: invalid node specified builder.hpc8888.com
[2014-05-08T21:29:52.419] _pick_step_nodes: invalid node list builder.hpc8888.com
[2014-05-08T21:29:52.419] _slurm_rpc_job_step_create for job 135: Requested node configuration is not available
[2014-05-08T21:29:52.446] sched: _slurm_rpc_step_complete StepId=135.2 usec=11696
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you are logged in on the nodes, what does hostname return? If it returns the FQDN, can you change it to return only the short name?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Both nodes return the short name using hostname.
slurm.conf refers to the nodes using short names as well.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you try this with the Intel® MPI Library 5.0 Beta? If you're not already registered, go to http://bit.ly/sw-dev-tools-2015-beta for details.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Updated to Slurm 14.03.3-2 and Intel Cluster Studio XE beta; IMPI is v5.0.0.016.
Unfortunately, I get exactly the same error messages with JMI.
srun is being invoked with the fqdn of the node, and Slurm responds with "invalid node specified".
[jmi-slurm@builder] Launch arguments: srun --nodelist builder.hpc8888.com -N 1 -n 1 ./hello
[jmi-slurm@ruchba] Launch arguments: srun --nodelist ruchba.hpc8888.com -N 1 -n 1 ./hello
JMI chooses to use the fqdn naming, yet Slurm's allocation shows the short name
SLURM_JOB_NODELIST=builder,ruchba.
The other two methods of invocation still work, i.e.,
I_MPI_PMI_LIBRARY=/opt/slurm/slurm/lib64/libpmi.so srun -n 4 hello
mpiexec.hydra -bootstrap slurm -n 4 ./hello
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We're submitting this to our developers for further investigation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have been informed by support that this will be fixed in Intel MPI 5.0.2.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page