Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2214 Discussions

Error with MUMPS when running a large model

Guoqi_Ma
Beginner
423 Views

Hi, recently, I encounter a error when I run a large model using up to 20 HPC nodes,  however, when I run a small model e.g. 2 nodes, errors are gone.  Have anyone met this error before ? Thanks very much.

 

 

Abort(1687183) on node 84 (rank 84 in comm 0): Fatal error in internal_Iprobe: Other MPI error, error stack:
internal_Iprobe(14309).........: MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm=0x84000006, flag=0x7ffe8b171570, status=0x7ffe8b171990) failed
MPID_Iprobe(389)...............:
MPIDI_Progress_test(105).......:
MPIDI_OFI_handle_cq_error(1127): OFI poll failed (ofi_events.c:1127:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected)
Abort(405913231) on node 47 (rank 47 in comm 0): Fatal error in internal_Iprobe: Other MPI error, error stack:
internal_Iprobe(14309).........: MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm=0x84000006, flag=0x7ffe001734c0, status=0x7ffe001738e0) failed
MPID_Iprobe(385)...............:
MPIDI_iprobe_safe(246).........:
MPIDI_iprobe_unsafe(72)........:
MPIDIG_mpi_iprobe(48)..........:
MPIDI_Progress_test(105).......:
MPIDI_OFI_handle_cq_error(1127): OFI poll failed (ofi_events.c:1127:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected)
Abort(672775823) on node 79 (rank 79 in comm 0): Fatal error in internal_Iprobe: Other MPI error, error stack:
internal_Iprobe(14309).........: MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm=0x84000006, flag=0x7ffce3ea9430, status=0x7ffce3ea9850) failed
MPID_Iprobe(389)...............:
MPIDI_Progress_test(105).......:
MPIDI_OFI_handle_cq_error(1127): OFI poll failed (ofi_events.c:1127:MPIDI_OFI_handle_cq_error:Transport endpoint is not connected)

 

Compile commands:

#!/bin/bash
#SBATCH --nodes=20
#SBATCH --ntasks=150
#SBATCH --partition=prod
#SBATCH --exclusive
#SBATCH --job-name=MumS4
#SBATCH --time=8:00:00
#SBATCH -e mumps.%j.err
#SBATCH --output=MOUT.%j.out
#SBATCH --account=kunf0069
module purge
module load intel/2023.2-gcc-9.4
module load impi/2021.10.0
module load mumps/5.4.1

filename=MUVSC3thrust_S
mpiifort -O2 -xHost -nofor-main -DBLR_MT -qopenmp -c $filename.f90 -o $filename.o

mpiifort -o $filename -O2 -xHost -nofor-main -qopenmp $filename.o -lcmumps -lmumps_common -lmpi -lpord -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lpthread -lparmetis -lmetis -lptesmumps -lptscotch -lptscotcherr -lscotch

export OMP_NUM_THREADS=1
mpirun -np 150 ./${filename} |tee MUMPS.log

 


 

0 Kudos
1 Reply
TobiasK
Moderator
375 Views

@Guoqi_Ma 
sorry with the information provided we can neither help you nor reproduce your issue.
Also this error looks like an error in MUMPS, have you reached out to the MUMPS developers?

PS:
2023.2 is too old, please use 2024.2.

0 Kudos
Reply