- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, recently, I encounter a error when I run a large model using up to 20 HPC nodes, however, when I run a small model e.g. 2 nodes, errors are gone. Have anyone met this error before ? Thanks very much.
Abort(1687183) on node 84 (rank 84 in comm 0): Fatal error in internal_Iprobe: Other MPI error, error stack:
internal_Iprobe(14309).........: MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm=0x84000006, flag=0x7ffe8b171570, status=0x7ffe8b171990) failed
MPID_Iprobe(389)...............:
MPIDI_Progress_test(105).......:
MPIDI_OFI_handle_cq_error(1127): OFI poll failed (ofi_events.c:1127:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected)
Abort(405913231) on node 47 (rank 47 in comm 0): Fatal error in internal_Iprobe: Other MPI error, error stack:
internal_Iprobe(14309).........: MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm=0x84000006, flag=0x7ffe001734c0, status=0x7ffe001738e0) failed
MPID_Iprobe(385)...............:
MPIDI_iprobe_safe(246).........:
MPIDI_iprobe_unsafe(72)........:
MPIDIG_mpi_iprobe(48)..........:
MPIDI_Progress_test(105).......:
MPIDI_OFI_handle_cq_error(1127): OFI poll failed (ofi_events.c:1127:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected)
Abort(672775823) on node 79 (rank 79 in comm 0): Fatal error in internal_Iprobe: Other MPI error, error stack:
internal_Iprobe(14309).........: MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm=0x84000006, flag=0x7ffce3ea9430, status=0x7ffce3ea9850) failed
MPID_Iprobe(389)...............:
MPIDI_Progress_test(105).......:
MPIDI_OFI_handle_cq_error(1127): OFI poll failed (ofi_events.c:1127:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected)
Compile commands:
#!/bin/bash
#SBATCH --nodes=20
#SBATCH --ntasks=150
#SBATCH --partition=prod
#SBATCH --exclusive
#SBATCH --job-name=MumS4
#SBATCH --time=8:00:00
#SBATCH -e mumps.%j.err
#SBATCH --output=MOUT.%j.out
#SBATCH --account=kunf0069
module purge
module load intel/2023.2-gcc-9.4
module load impi/2021.10.0
module load mumps/5.4.1
filename=MUVSC3thrust_S
mpiifort -O2 -xHost -nofor-main -DBLR_MT -qopenmp -c $filename.f90 -o $filename.o
mpiifort -o $filename -O2 -xHost -nofor-main -qopenmp $filename.o -lcmumps -lmumps_common -lmpi -lpord -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lpthread -lparmetis -lmetis -lptesmumps -lptscotch -lptscotcherr -lscotch
export OMP_NUM_THREADS=1
mpirun -np 150 ./${filename} |tee MUMPS.log
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Guoqi_Ma
sorry with the information provided we can neither help you nor reproduce your issue.
Also this error looks like an error in MUMPS, have you reached out to the MUMPS developers?
PS:
2023.2 is too old, please use 2024.2.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page