Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Paul_E_1
Beginner
87 Views

Bug in Intel MPI 4.1.0.024 with slurm-2.5.4

It looks like there is a bug in the way Intel MPI interacts with SLURM.  I had the following hostlist in SLURM_JOB_NODELIST

itc[011-012,021,101]

Other versions of MPI such as OpenMPI have had no problems interpreting this.  However Intel MPI when it used that node list it tried to find itc017.  That isn't even a valid hostname let alone at that hostlist.

I wrote a script to bypass this and generate the correct host list and explicitly pass it to Intel MPI.  However, it would be better to fix this inside of Intel MPI itself.

0 Kudos
4 Replies
James_T_Intel
Moderator
87 Views

Hi Paul,

Thank you for this report.  Can you please run a test program (the provided MPI test programs will work perfectly) with I_MPI_DEBUG=5 and send the output?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Paul_E_1
Beginner
87 Views

Sure here is the output.  The host list was: itc[011-012,092,101]

/n/sw/intel_cluster_studio-2013/impi-4.1.0.024/bin64/mpirun: line 262: printf: 092: invalid octal number
srun: error: Unable to create job step: Requested node configuration is not available
[mpiexec@itc011.rc.fas.harvard.edu] HYD_pmcd_pmiserv_send_signal (./pm/pmiserv/pmiserv_cb.c:221): assert (!closed) failed
[mpiexec@itc011.rc.fas.harvard.edu] ui_cmd_cb (./pm/pmiserv/pmiserv_pmci.c:128): unable to send SIGUSR1 downstream
[mpiexec@itc011.rc.fas.harvard.edu] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@itc011.rc.fas.harvard.edu] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:388): error waiting for event
[mpiexec@itc011.rc.fas.harvard.edu] main (./ui/mpich/mpiexec.c:718): process manager error waiting for completion

ccnhpc
Beginner
87 Views

Hi,

This is a bug in mpirun (6000024691). The printf command is in script mpirun, the host with number beginning by 0 are convert in octal.

Line 262 et 626 of the script mpirun :

${base_name}%0${first_node_length}d" ${host}`

should be replace by

${base_name}%0${first_node_length}d" ${host#0}`

The ${host#0} will remove all '0' from the beginning of host number, then it will no more be interpreted as octal.

Regards,

Bruno

James_T_Intel
Moderator
87 Views

Please try using Version 4.1 Update 3.  This problem should be corrected.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Reply