- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a Fortran code that uses both MPI and OpenMP. I am trying to run it on a cluster running Red Hat Enterprise Linux Server 7.8 OS with Intel Parallel Studio XE 2020 (1.217) installed. The system uses Sun Grid Engine as the job scheduler. I can successfully submit my job on some of the newer nodes but I kept getting the following message when I try it on the older nodes:
/net/ihn02/opt/intel/compilers_and_libraries_2020.1.217/linux/mpi/intel64/bin/mpiexec.hydra
[mpiexec@cn28] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on cn08 (pid 3745, exit code 65280)
[mpiexec@cn28] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@cn28] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@cn28] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:770): error waiting for event
[mpiexec@cn28] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1956): error setting up the boostrap proxies
I checked that all the nodes are running the same version of OS and the Intel tools are installed on a shared space that all the nodes have access to. Does anyone know what might cause this failure? Occasionally, I can get a successful run but it fails most of the times (I would say >95% failure). Thanks in advance for any help.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am also facing the same issue with the TensorFlow application when run on GPUs. I am using MPI Library 2019 Update 7 for Linux on RTX5000 GPUs. I have also tried setting export UCX_TLS="knem,rc" but I get this issue most of the time.
Any suggestions on this issue?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately, there is no answer yet. Our system administrator said he will take a look at it but I have not heard anything back from him yet.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For anyone coming from google for this problem, it turns out that the job scheduler from the cluster was trying to schedule the job on a basic node which does not have infiniband, modifying the script to specify standard nodes solved the issue. Our admins mentioned that if you do want to run MPI on basic nodes without infiniband, you can use export UCX_TLS=tcp
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The solution that worked for me for this exact same error messages was setting environment variable: I_MPI_HYDRA_IFACE to the infiniband interface, e.g.:
export I_MPI_HYDRA_IFACE="ib0"
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for that!
We had the same problem. Master node without infiniband connection. The infiniband connection was only between the hpc nodes. That helped and worked for us.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Looks like it is running now.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page