- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I'm implementing a dynamic scheduler for solving several sparse matrices (using the well known MUMPS solver) in parallel. Each process will ask for new work (new matrix, actually just a number of the matrix) to the work manager when it completes his task. The manager code is ran as a separate thread in master processes so the master process can do some work as well. This works well 9 out of 10 times but sometimes everything is just hanging. When I attach the debugger when this happens it seems that the processes are blocking at MPI_Test for some reason. This should not happen because MPI_Test is the non-blocking version of MPI_Wait. Any idea what could be wrong or how I can debug this.
I'm trying to use Intel Trace Analyser but I'm only able to get traces of working runs. When my program hangs (some kind of deadlock i guess) I have to kill all processes but this also means I do not get a trace.
I tried using VTmt.lib to check for errors but get none.
I tried using VTfs.lib to automatically detect deadlocks when tracing but it is unable do detect this case.
Please advice me on what could cause MPI_Test to become blocking of how I can debug this case.
Thanks in advance
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you linking with the multithreaded MPI library?
Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you have a small reproducer for this behavior? If you prefer, you can either post it in a private reply or email it to me directly.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you run it with -verbose or link with VTmc.lib?
Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Edit: actually it detected a no progress after 5 minutes after i've done some changes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you have the output after running with -verbose? Please send that and I'll see if there's anything obvious there. You can also use "-genv I_MPI_DEBUG 5" for more information.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
0] WARNING: Processes have been blocked on average inside MPI for the last 5:05 minutes:
0] WARNING: either the application has a load imbalance or a deadlock which is not detected
0] WARNING: because at least one process polls for message completion instead of blocking
0] WARNING: inside MPI.
0] WARNING: [0] last MPI call:
0] WARNING: MPI_COMM_FREE(*comm=0x0000000006dcc380, *ierr=0x00000000084da4cc)
0] WARNING: ZMUMPS (sysnoise)
0] WARNING: ZMUMPSCPP (...\mumpscpp.cpp:8)
0] WARNING: SOLVERMUMPS_CLEAR (...\mumps.f:256)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Based on that, I would check for something still using the communicator that you are attempting to free. Ensure that you are not reaching a race condition somewhere. I don't think the -verbose (or I_MPI_DEBUG) output will help here, but if you want to send that, feel free to do so.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The free is done by the MUMPS solver itself (on a duplicate of the COMM_SELF believe). The debug outpout doesn't give more information. But the good news is that I managed to reproduce the issue whe MPI_test becomes blocking in a small code example. It's 7mb including data and the mumps libs. How can is send this to you? I can also upload it on Intel Primier support.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Since you have Premier access, that would probably be the best option. Just attach it to a new issue and mention this thread, in case someone else gets the issue.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page