- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It looks like there is a bug in the way Intel MPI interacts with SLURM. I had the following hostlist in SLURM_JOB_NODELIST
itc[011-012,021,101]
Other versions of MPI such as OpenMPI have had no problems interpreting this. However Intel MPI when it used that node list it tried to find itc017. That isn't even a valid hostname let alone at that hostlist.
I wrote a script to bypass this and generate the correct host list and explicitly pass it to Intel MPI. However, it would be better to fix this inside of Intel MPI itself.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Paul,
Thank you for this report. Can you please run a test program (the provided MPI test programs will work perfectly) with I_MPI_DEBUG=5 and send the output?
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sure here is the output. The host list was: itc[011-012,092,101]
/n/sw/intel_cluster_studio-2013/impi-4.1.0.024/bin64/mpirun: line 262: printf: 092: invalid octal number
srun: error: Unable to create job step: Requested node configuration is not available
[mpiexec@itc011.rc.fas.harvard.edu] HYD_pmcd_pmiserv_send_signal (./pm/pmiserv/pmiserv_cb.c:221): assert (!closed) failed
[mpiexec@itc011.rc.fas.harvard.edu] ui_cmd_cb (./pm/pmiserv/pmiserv_pmci.c:128): unable to send SIGUSR1 downstream
[mpiexec@itc011.rc.fas.harvard.edu] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@itc011.rc.fas.harvard.edu] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:388): error waiting for event
[mpiexec@itc011.rc.fas.harvard.edu] main (./ui/mpich/mpiexec.c:718): process manager error waiting for completion
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
This is a bug in mpirun (6000024691). The printf command is in script mpirun, the host with number beginning by 0 are convert in octal.
Line 262 et 626 of the script mpirun :
${base_name}%0${first_node_length}d" ${host}`
should be replace by
${base_name}%0${first_node_length}d" ${host#0}`
The ${host#0} will remove all '0' from the beginning of host number, then it will no more be interpreted as octal.
Regards,
Bruno
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please try using Version 4.1 Update 3. This problem should be corrected.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page