Job is dead When I use multi-node, large processor

Hdragon · ‎04-03-2024

Hi all,

I'm recently running a large simulation that requires large memory, so I started to use multi-node.

Before then, I had no problem running simulations since I used just one node, but my job has been dead after utilizing multiple nodes.

Sometimes it worked well without any problem. However, it often failed to run my job.

I met two kinds of errors.

First one:

[mpiexec@node7] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on node8 (pid 212699, exit code 256)
[mpiexec@node7] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@node7] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@node7] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1062): error waiting for event
[mpiexec@node7] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1015): error setting up the bootstrap proxies
[mpiexec@node7] Possible reasons:
[mpiexec@node7] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@node7] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@node7] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@node7] 4. pbs bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher

Second one:

[proxy:0:0@node4] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:0@node4] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0@node4] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:3@node1] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:3@node1] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:3@node1] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:1@node5] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:1@node5] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1@node5] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:2@node6] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:2@node6] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@node6] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:5@node3] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:5@node3] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:5@node3] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@node4] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@node4] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@node4] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@node4] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion

I've tried to check what is the problem, but couldn't find it. I would appreciate it if you let me know how to deal with it.

This is my information

OS: Centos OS

Library: oneapi-2021.4.0 or mpich-3.1.4

Job scheduler: Torque

Job script:

------

#!/bin/sh
#PBS -V
#PBS -v LD_LIBRARY_PATH=$LD_LIBRARY_PATH

#PBS -N Dark-matter-simulation
#PBS -q workq

#PBS -l nodes=6:ppn=36

#PBS -l walltime=120:00:00

#PBS -m abe

#PBS -M my_email@intel.com

cd $PBS_O_WORKDIR
echo $PBS_NODEFILE

mpirun -np 216 ./enzo.exe parameter_file.txt 1>stdout 2>stderr

exit 0

----

My trying:

Using following command

mpirun -bootstrap ssh -machinefile $PBS_NODEFILE -np 216 ./enzo.exe parameter_file.txt 1>stdout 2>stderr

--> It also failed without any error message

Check the memory and max user processor using ulimit -a command

--> it is set to unlimited

What makes me insane is sometimes it works... what can I do? Again, any help would be appreciated.

Sincerely,

HY

TobiasK · ‎04-15-2024

Dear @Hdragon

note, CentOS is not supported by oneAPI 2024.1.
In this forum, we can only provide help on issues with the latest oneAPI package.

Just a general advice, before trying a large application, please make sure your job schedular is set up correctly, e.g. you can run something like
mpirun ... hostname
and some simple MPI benchmarks like the IMB-MPI1 benchmarks that we bundle with Intel MPI.

mpirun ... IMB-MPI1

If those tests succeed, then you can try to run more complex applications.