Showing results for 
Search instead for 
Did you mean: 

MPI_Comm_dup may hang with Intel MPI 4.1


The attached program simple_repro.c reproduces what I believe is a bug in the Intel MPI implementation version 4.1.

In short, what it does is it spawns <num_threads> threads on 2 processes, such that thread i on rank 0 is supposed to communicate with thread i on rank 1 using their private communicator. The only difference between the 2 processes involved is that the threads on rank 0 are coordinated with a semaphore, such that they can't all be active at the same time. Threads on rank 1 run freely.

The problem is that if the communication between a pair of threads involves creating a child communicator via MPI_Comm_dup(), it is very likely that they will run into a deadlock situation, where <sem_value> pairs of threads are stuck in (comm_dup, comm_dup) and <num_threads> - <sem_value> pairs of threads are stuck in (sem_wait, comm_dup). See attached stack traces. This sounds to me like a starvation problem.

$ mpigcc -mt_mpi -O3 -Dnum_threads=4 -Dnum_reps=10 -Dsem_value=1 simple_repro.c -o simple_repro
$ mpirun -n 2 `pwd`/simple_repro
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank    Pid      Node name          Pin cpu
[0] MPI startup(): 0       24808    localhost          {0,1,4,5}
[0] MPI startup(): 1       24809    localhost          {2,3,6,7}
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 2

The exact same program never hangs with impi 5.0, not even with high values for <num_threads> and <num_reps>.

Can anybody confirm this is a library issue that has been fixed in version 5.0?
Thank you!


0 Kudos
1 Reply

I cannot find a specific bug report for this, but it is very likely the root cause was fixed under a different symptom.