MPI Hydra fails to execute binary due to problematic distributed filesystem

alexey-medvedev · ‎10-19-2020

On Lomonosov-2 supercomputer (http://hpc.msu.ru/node/159, partition: "pascal") with IMPI version 2019.4.243, mpiexec.hydra occasionally fails to start, especially with full ppn (ppn=12), and on large number of nodes in task.

The problem is actually related to the parallel filesystem on this computer which produces occasional delays at higher parallel load. The message is:

[proxy:0:0@n54104.10p.parallel.ru] HYD_spawn (../../../../../src/pm/i_hydra/libhydra/spawn/intel/hydra_spawn.c:117): execvp error on file ./testapp (Input/output error)

There are two observations related to this:
(1) if we move the executable from this parallel filesystem to some local place (like doing a preliminary copy of the executable to node-local /tmp/ directory), the failure disappears.

(2) With other MPI implementations on this machine startup always works well.

The suggestion is to change hydra_spawn.c code and introduce some loop with attempts around execvp() to be able to make several attempts of exec() syscall if there is kind of recoverable error state like "i/o error".

What is you opinion on the topic?

--
Regards,
Alexey

GouthamK_Intel · ‎10-20-2020

Hi Alexey,

Thanks for reaching out to us!

Could you please provide the following details which help to us debug your issue:

Parallel filesystem you are using
As you have mentioned that "mpiexec.hydra occasionally fails to start", how often you are facing this issue.
Error logs when you are getting the errors. (you can get this info by setting environment variables I_MPI_DEBUG=10 and I_MPI_HYDRA_DEBUG=on)

Regarding 'The suggestion is to change hydra_spawn.c code and introduce some loop with attempts around execvp() to be able to make several attempts of exec() syscall if there is kind of recoverable error state like "i/o error".'

Thanks for the suggestion, we will discuss this with the concerned internal team.

Thanks & Regards

Goutham

alexey-medvedev · ‎10-21-2020

Hi Goutham,

Some details:
1) parallel system on Lomonosov-2 which causes problems is Lustre. I can guess only a few details on its configuration as an ordinary user: it is mounted with flags: "rw,localflock,lazystatfs" and it uses "o2ib" LND in some way (presumably over Mellanox hardware). For more details I'll need to ask sysadmins to expand on the topic. Please let me know if it is required.

2) I experimented just now: for 8-node, 14-ppn startups, all 3 out of 3 attempts failed with the same "Input/output error" diagnostics. The difference it in the number of rank which showed the error message: ranks 71, 73 in first attempt, 72 in second, 74 in third. Seems to be rather random.

I used to workaround this by copying the executable to local /tmp/ before launch on each node, but now the workaround stopped working due to some system configuration changes related to /tmp. So IMPI 2019 is currently unusable on Lomonosov-2.

3) With debug options in env:
---
mpiexec.hydra output:
---
++ mpiexec.hydra -np 112 -ppn 14 --errfile-pattern=err.1179112.%r --outfile-pattern=out.1179112.%r ./simple-iallreduce.sh
[mpiexec@n48630.10p.parallel.ru] Launch arguments: /opt/slurm/15.08.1/bin/srun -N 8 -n 8 --nodelist n48630,n50009,n50010,n50011,n50012,n50013,n50014,n50015 --input none /opt/intel/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin//hydra_bstrap_proxy --upstream-host n48630 --upstream-port 43336 --pgid 0 --launcher slurm --launcher-number 1 --base-path /opt/intel/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug /opt/intel/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
[mpiexec@n48630.10p.parallel.ru] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydra_sock_intel.c:357): write error (Bad file descriptor)
[mpiexec@n48630.10p.parallel.ru] cmd_bcast_root (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:164): error sending cmd 9 to proxy
[mpiexec@n48630.10p.parallel.ru] send_signal_downstream (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:582): unable to send response downstream
[mpiexec@n48630.10p.parallel.ru] control_cb (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1581): unable to send signal to downstreams
[mpiexec@n48630.10p.parallel.ru] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:79): callback returned error status
[mpiexec@n48630.10p.parallel.ru] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1962): error waiting for event
---
strerr output for one of ranks:
---
[proxy:0:5@n50013.10p.parallel.ru] HYD_spawn (../../../../../src/pm/i_hydra/libhydra/spawn/intel/hydra_spawn.c:117): execvp error on file ./simple-iallreduce.sh (Input/output error)
---

Thanks for help!

--
Regards,
Alexey

GouthamK_Intel · ‎10-26-2020

Hi Alexey,

Thanks for providing the necessary details.

We are assuming that mpiexec.hydra is unable to find the executable within the time due to delay in the parallel filesystem.

However, since you are working on Haswell Processor could you please try running the executable with legacy mpiexec.hydra and let us know the outcome.

You can find the legacy mpiexec.hydra in bin/legacy path.

Command:

<mpi installed directory>/intel64/bin/legacy/mpiexec.hydra -np 112 -ppn 14 --errfile-pattern=err.1179112.%r --outfile-pattern=out.1179112.%r ./simple-iallreduce.sh

Thanks & Regards

Goutham

alexey-medvedev · ‎10-26-2020

Hi Goitham,

I tried legacy mpiexec.hydra. The result is mostly the same, even though I managed to execute program successfully a few times. But there appear some additional issues:

1) the output from mpiexec now contains lines:

[0] MPI startup(): I_MPI_HYDRA_UUID environment variable is not supported.
[0] MPI startup(): Similar variables:
         I_MPI_HYDRA_ENV
         I_MPI_HYDRA_RMK
[0] MPI startup(): I_MPI_PM environment variable is not supported.
[0] MPI startup(): Similar variables:
         I_MPI_PMI_LIBRARY
[0] MPI startup(): I_MPI_RANK_CMD environment variable is not supported.
[0] MPI startup(): I_MPI_CMD environment variable is not supported.
[0] MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started.

The stderr output when execution is not successful:

[proxy:0:0@n54208.10p.parallel.ru] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file ./IMB-ASYNC (Input/output error)
[proxy:0:0@n54208.10p.parallel.ru] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file ./IMB-ASYNC (Input/output error)
[proxy:0:0@n54208.10p.parallel.ru] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file ./IMB-ASYNC (Input/output error)
[proxy:0:0@n54208.10p.parallel.ru] HYDU_create_process (../../utils/launch/launch.c:825): execvp error on file ./IMB-ASYNC (Input/output error)

[mpiexec@n54208.10p.parallel.ru] HYDT_bscu_wait_for_completion (../../tools/bootstrap/utils/bscu_wait.c:151): one of the processes terminated badly; aborting
[mpiexec@n54208.10p.parallel.ru] HYDT_bsci_wait_for_completion (../../tools/bootstrap/src/bsci_wait.c:36): launcher returned error waiting for completion
[mpiexec@n54208.10p.parallel.ru] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:527): launcher returned error waiting for completion
[mpiexec@n54208.10p.parallel.ru] main (../../ui/mpich/mpiexec.c:1148): process manager error waiting for completion

And one more thing: when the execution is successful, all nodes besides one where mpiexec.hydra was launched (head node) appear to run calculations slowly, as if their CPU clock is 3 times slower than normal. That is quite a strange experience, but this problem is reproduced fairly well with the legacy hydra, and is not reproduced neither with normal IMPI2019 hydra, nor with OpenMPI mpirun.

So switching to legacy hydra doesn't solve the problem, and most likely adds some more issues.

I also just made sure once again that I don't see any issues like these with OpenMPI's mpirun, and I suspect that all users of Lomonosov-2 are in the same situation, just are forced to switch to OpenMPI.

--
Regards,
Alexey

GouthamK_Intel · ‎10-29-2020

Hi Alexey,

Thanks for providing the logs.

We think that delay from parallel file system might be causing the mpiexec.hydra to fail as it unable to find the executable at the right time.

Another experiment to try out, could you please try upgrading the Intel MPI to the latest version and try running the executable and let us know the outcome.

Have a Good day!

Thanks & Regards

Goutham

alexey-medvedev · ‎11-02-2020

Hi Goutham,

I'll try to upgrade to 2019u8, but this will take some time. Will return with the outcome then.

--
Regards,
Alexey

alexey-medvedev-MSU · ‎11-03-2020

Hi Goutham,

On 2019u8 I see the same Input/Output errors.

--
Regards,
Alexey

GouthamK_Intel · ‎11-06-2020

Hi Alexey,

Thanks for trying with the latest version.

As you have suggested a workaround in the initial query:

"The suggestion is to change hydra_spawn.c code and introduce some loop with attempts around execvp() to be able to make several attempts of exec() syscall if there is kind of recoverable error state like "i/o error"."

Could you please let us know how did you come up with this workaround, have you found any similar implementation in OpenMPI?

If yes, please provide the details, link for the same. So that we can investigate more about your issue.

Also, could you please test and confirm whether OpenMPI works every time?

Thanks & Regards

Goutham

GouthamK_Intel · ‎11-11-2020

Hi Alexey,

Could please provide the requested information in the previous post from us. So that it helps us to investigate more on your issue.

Thanks

Goutham

alexey-medvedev-MSU · ‎11-11-2020

Hi Goutham,

>> Could you please let us know how did you come up with this workaround, have you found any similar implementation in OpenMPI?

Not sure we can call this "a workaround", but rather a "suggestion/idea of a possible workaround". I explored the OpenMPI source and found out that they use a quite different architecture of parallel process launcher so there is no (and can't be) similar code there. The pattern of creating a loop around a Unix system call which checks for errors and retries on some (like EINTR or EAGAIN) is generic and well-described, so this idea just came from my general thoughts on the matter.

>> Also, could you please test and confirm whether OpenMPI works every time?

I'm going to invest some time to extensively check that OpenMPI always works, just to be absolutelly sure. This will eat some portions of my time, so I will come back with the results a bit later.

--
Regards,
Alexey

GouthamK_Intel · ‎11-19-2020

Hi Alexey,

Thanks for spending your time in exploring the OpenMPI Source code and helping us in understanding more about the suggestion you have provided in your previous post.

And, regarding "I'm going to invest some time to extensively check that OpenMPI always works, just to be absolutelly sure. This will eat some portions of my time, so I will come back with the results a bit later."

Please let us know once you have the results. So that we will forward that information along with the suggestion provided by you in earlier posts to the concerned internal team.

Have a Good day!

Regards

Goutham

GouthamK_Intel · ‎12-02-2020

Hi Alexey,

>> "I'm going to invest some time to extensively check that OpenMPI always works, just to be absolutely sure. This will eat some portions of my time, so I will come back with the results a bit later."

Could you please let us know the status of your issue?

Also, please provide the requested details.

Thanks

Goutham

alexey-medvedev-MSU · ‎12-02-2020

Hi Goutham,

>> Could you please let us know the status of your issue? Also, please provide the requested details.

I'm on vacation till next week, will return with the requested details then. Thank you.

--
Regards,
Alexey

GouthamK_Intel · ‎12-07-2020

Hi Alexey,

Please let us know once you are back and have the results. So that we will forward that information along with the suggestion provided by you in earlier posts to the concerned internal team.

Have a Happy Vacation!

Stay safe and enjoy!

Regards

Goutham

GouthamK_Intel · ‎12-16-2020

Hi Alexey,

As we didn't hear back from you, as per the process we are closing this thread for now. Once you are back from the vacation you are always welcome to start a new thread with your results attached there and we would be very happy to help you there.

So we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Have a Good day!

Thanks & Regards

Goutham