Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Intel MPI with JMI and Slurm: Requeseted node configuration is not available

Beaver6675
Novice
2,584 Views

Intel MPI version 4.1.3, Slurm version 2.6.9-1

I am trying to follow the Intel MPI documentation to run a job under Slurm with -bootstrap jmi but am 

getting the error message as below:

 

salloc -N 1 :

export I_MPI_HYDRA_JMI_LIBRARY=/opt/intel/impi/4.1.3/lib/intel64/lib/libjmi_slurm.so

mpiexec.hydra -bootstrap slurm -n 2 hostname ## << this works

mpiexec.hydra -bootstrap jmi -n 2 hostname ## <<this does not work

srun: error: Unable to create job step: Requested node configuration is not available

srun: error: Unable to create job step: Requested node configuration is not available

If I look at Slurm logs, it is trying to get a node assignment for the fqdn of the node, even though

I only use short names in slurm.conf. Not sure it this has anything to do with JMI/Slurm interaction.

If I use

I_MPI_PMI_LIBRARY=/opt/slurm/14.03.1-2/lib64/libpmi.so

srun -n 2 mympiprog

it works too.

0 Kudos
1 Solution
Beaver6675
Novice
2,584 Views

I have been informed by support that this will be fixed in Intel MPI 5.0.2.

View solution in original post

0 Kudos
8 Replies
James_T_Intel
Moderator
2,584 Views

Can you provide the output from the -bootstrap jmi version with I_MPI_HYDRA_DEBUG=1?

0 Kudos
Beaver6675
Novice
2,583 Views

Hi James

 

Attaching the debug output; the key part seems to be node request by fqdn

 

[jmi-slurm@builder] Launch arguments: srun --nodelist builder.hpc8888.com -N 1 -n 1 ./hello

- requesting by fqdn

- error message is 

srun: error: Unable to create job step: Requested node configuration is not available

Slurm log

[2014-05-08T21:29:02.760] sched: job_complete for JobId=133 successful, exit code=0
[2014-05-08T21:29:09.647] sched: _slurm_rpc_allocate_resources JobId=135 NodeList=builder,ruchba usec=13039
[2014-05-08T21:29:31.848] sched: _slurm_rpc_job_step_create: StepId=135.0 builder,ruchba usec=7566
[2014-05-08T21:29:31.974] sched: _slurm_rpc_step_complete StepId=135.0 usec=11970
[2014-05-08T21:29:47.668] sched: _slurm_rpc_job_step_create: StepId=135.1 builder,ruchba usec=14242
[2014-05-08T21:29:47.750] sched: _slurm_rpc_step_complete StepId=135.1 usec=13498
[2014-05-08T21:29:52.374] sched: _slurm_rpc_job_step_create: StepId=135.2 builder,ruchba usec=15330
[2014-05-08T21:29:52.416] error: find_node_record: lookup failure for builder.hpc8888.com
[2014-05-08T21:29:52.416] error: node_name2bitmap: invalid node specified builder.hpc8888.com
[2014-05-08T21:29:52.416] _pick_step_nodes: invalid node list builder.hpc8888.com
[2014-05-08T21:29:52.416] _slurm_rpc_job_step_create for job 135: Requested node configuration is not available
[2014-05-08T21:29:52.419] error: find_node_record: lookup failure for builder.hpc8888.com
[2014-05-08T21:29:52.419] error: node_name2bitmap: invalid node specified builder.hpc8888.com
[2014-05-08T21:29:52.419] _pick_step_nodes: invalid node list builder.hpc8888.com
[2014-05-08T21:29:52.419] _slurm_rpc_job_step_create for job 135: Requested node configuration is not available
[2014-05-08T21:29:52.446] sched: _slurm_rpc_step_complete StepId=135.2 usec=11696

 

 

0 Kudos
James_T_Intel
Moderator
2,584 Views

If you are logged in on the nodes, what does hostname return?  If it returns the FQDN, can you change it to return only the short name?

0 Kudos
Beaver6675
Novice
2,584 Views

Both nodes return the short name using hostname.

slurm.conf refers to the nodes using short names as well.

0 Kudos
James_T_Intel
Moderator
2,584 Views

Can you try this with the Intel® MPI Library 5.0 Beta?  If you're not already registered, go to http://bit.ly/sw-dev-tools-2015-beta for details.

0 Kudos
Beaver6675
Novice
2,584 Views

Updated to Slurm 14.03.3-2 and Intel Cluster Studio XE beta; IMPI is v5.0.0.016.

Unfortunately, I get exactly the same error messages with JMI.

srun is being invoked with the fqdn of the node, and Slurm responds with "invalid node specified".

[jmi-slurm@builder] Launch arguments: srun --nodelist builder.hpc8888.com -N 1 -n 1 ./hello

[jmi-slurm@ruchba] Launch arguments: srun --nodelist ruchba.hpc8888.com -N 1 -n 1 ./hello

JMI chooses to use the fqdn naming, yet Slurm's allocation shows the short name

SLURM_JOB_NODELIST=builder,ruchba.

The other two methods of invocation still work, i.e.,

I_MPI_PMI_LIBRARY=/opt/slurm/slurm/lib64/libpmi.so srun -n 4 hello

mpiexec.hydra -bootstrap slurm -n 4 ./hello

 

 

0 Kudos
James_T_Intel
Moderator
2,584 Views

We're submitting this to our developers for further investigation.

0 Kudos
Beaver6675
Novice
2,585 Views

I have been informed by support that this will be fixed in Intel MPI 5.0.2.

0 Kudos
Reply