Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2159 Discussions

MPI is blocking on MPI_Test, Trace Analyser question

gert_massa
Beginner
847 Views

Hi all,

I'm implementing a dynamic scheduler for solving several sparse matrices (using the well known MUMPS solver) in parallel. Each process will ask for new work (new matrix, actually just a number of the matrix) to the work manager when it completes his task. The manager code is ran as a separate thread in master processes so the master process can do some work as well. This works well 9 out of 10 times but sometimes everything is just hanging. When I attach the debugger when this happens it seems that the processes are blocking at MPI_Test for some reason. This should not happen because MPI_Test is the non-blocking version of MPI_Wait. Any idea what could be wrong or how I can debug this.

I'm trying to use Intel Trace Analyser but I'm only able to get traces of working runs. When my program hangs (some kind of deadlock i guess) I have to kill all processes but this also means I do not get a trace.

I tried using VTmt.lib to check for errors but get none.
I tried using VTfs.lib to automatically detect deadlocks when tracing but it is unable do detect this case.

Please advice me on what could cause MPI_Test to become blocking of how I can debug this case.

Thanks in advance

0 Kudos
11 Replies
James_T_Intel
Moderator
847 Views
Hi Gert,

Are you linking with the multithreaded MPI library?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
gert_massa
Beginner
847 Views
Yes and I'm using MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &MpiThreadLevel);
0 Kudos
James_T_Intel
Moderator
847 Views
Hi Gert,

Do you have a small reproducer for this behavior? If you prefer, you can either post it in a private reply or email it to me directly.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
gert_massa
Beginner
847 Views
Not yet, I'm already trying to reproduce it in a smaller code.
0 Kudos
James_T_Intel
Moderator
847 Views
Hi Gert,

Could you run it with -verbose or link with VTmc.lib?

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
gert_massa
Beginner
847 Views
it doesn't detect any errors. not even a deadlaock situation. Hoever the master process in blocking on MPI_test.

Edit: actually it detected a no progress after 5 minutes after i've done some changes.
0 Kudos
James_T_Intel
Moderator
847 Views
Hi Gert,

Do you have the output after running with -verbose? Please send that and I'll see if there's anything obvious there. You can also use "-genv I_MPI_DEBUG 5" for more information.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
gert_massa
Beginner
847 Views
0] WARNING: GLOBAL:DEADLOCK:NO_PROGRESS: warning
0] WARNING: Processes have been blocked on average inside MPI for the last 5:05 minutes:
0] WARNING: either the application has a load imbalance or a deadlock which is not detected
0] WARNING: because at least one process polls for message completion instead of blocking
0] WARNING: inside MPI.
0] WARNING: [0] last MPI call:
0] WARNING: MPI_COMM_FREE(*comm=0x0000000006dcc380, *ierr=0x00000000084da4cc)
0] WARNING: ZMUMPS (sysnoise)
0] WARNING: ZMUMPSCPP (...\mumpscpp.cpp:8)
0] WARNING: SOLVERMUMPS_CLEAR (...\mumps.f:256)
0 Kudos
James_T_Intel
Moderator
847 Views
Hi Gert,

Based on that, I would check for something still using the communicator that you are attempting to free. Ensure that you are not reaching a race condition somewhere. I don't think the -verbose (or I_MPI_DEBUG) output will help here, but if you want to send that, feel free to do so.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
gert_massa
Beginner
847 Views
Hi James,

The free is done by the MUMPS solver itself (on a duplicate of the COMM_SELF believe). The debug outpout doesn't give more information. But the good news is that I managed to reproduce the issue whe MPI_test becomes blocking in a small code example. It's 7mb including data and the mumps libs. How can is send this to you? I can also upload it on Intel Primier support.
0 Kudos
James_T_Intel
Moderator
847 Views
Hi Gert,

Since you have Premier access, that would probably be the best option. Just attach it to a new issue and mention this thread, in case someone else gets the issue.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
0 Kudos
Reply