- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I'm recently running a large simulation that requires large memory, so I started to use multi-node.
Before then, I had no problem running simulations since I used just one node, but my job has been dead after utilizing multiple nodes.
Sometimes it worked well without any problem. However, it often failed to run my job.
I met two kinds of errors.
First one:
[mpiexec@node7] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on node8 (pid 212699, exit code 256)
[mpiexec@node7] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@node7] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@node7] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1062): error waiting for event
[mpiexec@node7] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1015): error setting up the bootstrap proxies
[mpiexec@node7] Possible reasons:
[mpiexec@node7] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@node7] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@node7] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@node7] 4. pbs bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher
Second one:
[proxy:0:0@node4] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:0@node4] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0@node4] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:3@node1] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:3@node1] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:3@node1] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:1@node5] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:1@node5] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1@node5] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:2@node6] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:2@node6] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@node6] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:5@node3] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:5@node3] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:5@node3] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@node4] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@node4] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@node4] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@node4] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion
I've tried to check what is the problem, but couldn't find it. I would appreciate it if you let me know how to deal with it.
This is my information
OS: Centos OS
Library: oneapi-2021.4.0 or mpich-3.1.4
Job scheduler: Torque
Job script:
------
#!/bin/sh
#PBS -V
#PBS -v LD_LIBRARY_PATH=$LD_LIBRARY_PATH
#PBS -N Dark-matter-simulation
#PBS -q workq
#PBS -l nodes=6:ppn=36
#PBS -l walltime=120:00:00
#PBS -m abe
#PBS -M my_email@intel.com
cd $PBS_O_WORKDIR
echo $PBS_NODEFILE
mpirun -np 216 ./enzo.exe parameter_file.txt 1>stdout 2>stderr
exit 0
----
My trying:
Using following command
mpirun -bootstrap ssh -machinefile $PBS_NODEFILE -np 216 ./enzo.exe parameter_file.txt 1>stdout 2>stderr
--> It also failed without any error message
Check the memory and max user processor using ulimit -a command
--> it is set to unlimited
What makes me insane is sometimes it works... what can I do? Again, any help would be appreciated.
Sincerely,
HY
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear @Hdragon
note, CentOS is not supported by oneAPI 2024.1.
In this forum, we can only provide help on issues with the latest oneAPI package.
Just a general advice, before trying a large application, please make sure your job schedular is set up correctly, e.g. you can run something like
mpirun ... hostname
and some simple MPI benchmarks like the IMB-MPI1 benchmarks that we bundle with Intel MPI.
mpirun ... IMB-MPI1
If those tests succeed, then you can try to run more complex applications.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page