- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I tried running the following code on a Linux cluster with Intel MPI (Version 4.0 Update 3 Build 20110824) and slurm 2.2.7 on 2 nodes with 8 cores each (16 tasks).
Unfortunately, it hangs at the MPI_Win_unlock command during the 11th or 12th iteration. I have tried Intel compiler and gcc with no success.
[cpp]#include #include #define USE_BARRIER 1 #define LOCAL_RANK 10 #define REMOTE_RANK 3 int main(int argc, char** argv) { int rank, error; MPI_Win win; double* value; double local_value; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); error = MPI_Alloc_mem(sizeof(double), MPI_INFO_NULL, &value); if (error != MPI_SUCCESS) MPI_Abort(MPI_COMM_WORLD, error); error = MPI_Win_create(value, sizeof(double), sizeof(double), MPI_INFO_NULL, MPI_COMM_WORLD, &win); if (error != MPI_SUCCESS) MPI_Abort(MPI_COMM_WORLD, error); if (rank == LOCAL_RANK) for (int i = 0; i < 25; i++) { std::cout << "Iteration " << i << " in rank " << rank << std::endl; error = MPI_Win_lock(MPI_LOCK_SHARED, REMOTE_RANK, 0, win); if (error != MPI_SUCCESS) MPI_Abort(MPI_COMM_WORLD, error); error = MPI_Get(&local_value, 1, MPI_DOUBLE, REMOTE_RANK, 0, 1, MPI_DOUBLE, win); if (error != MPI_SUCCESS) MPI_Abort(MPI_COMM_WORLD, error); error = MPI_Win_unlock(REMOTE_RANK, win); if (error != MPI_SUCCESS) MPI_Abort(MPI_COMM_WORLD, error); } #ifdef USE_BARRIER MPI_Barrier(MPI_COMM_WORLD); #endif MPI_Win_free(&win); MPI_Free_mem(value); MPI_Finalize(); }[/cpp] Other MPI libraries work as expected, also other "configurations" work. E.g.:
[cpp]#define USE_BARRIER 0 #define LOCAL_RANK 10 #define REMOTE_RANK 3[/cpp] or
[cpp]#define USE_BARRIER 1 #define LOCAL_RANK 2 #define REMOTE_RANK 3[/cpp] If you need more information, let me know.
Thanks for your help,
Sebastian
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try adding "-env I_MPI_DEBUG 5" to the mpirun command. This will generate additional debug information and might provide some indication of what is causing the lock. I am able to run the original program you provided without any hangs. I will try some other combinations and see if I can cause the hang.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is the output of "I_MPI_DEBUG=5 srun ./test"[bash][-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=607830 [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=607830 [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=607830 [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=607830 [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=607830 [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=607830 [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=607830 [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=607830 [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=607830 [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=607830 [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=607830 [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=607830 [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=607830 [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=607830 [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=607830 [-1] MPI startup(): Imported environment partly inaccesible. Map=0 Info=607830 [8] MPI startup(): shm and ofa data transfer modes [9] MPI startup(): shm and ofa data transfer modes [2] MPI startup(): shm and ofa data transfer modes [6] MPI startup(): shm and ofa data transfer modes [4] MPI startup(): shm and ofa data transfer modes [5] MPI startup(): shm and ofa data transfer modes [0] MPI startup(): shm and ofa data transfer modes [1] MPI startup(): shm and ofa data transfer modes [3] MPI startup(): shm and ofa data transfer modes [10] MPI startup(): shm and ofa data transfer modes [7] MPI startup(): shm and ofa data transfer modes [14] MPI startup(): shm and ofa data transfer modes [11] MPI startup(): shm and ofa data transfer modes [12] MPI startup(): shm and ofa data transfer modes [13] MPI startup(): shm and ofa data transfer modes [15] MPI startup(): shm and ofa data transfer modes [0] MPI startup(): Rank Pid Node name Pin cpu [0] MPI startup(): 0 22239 r1i0n0 +1 [0] MPI startup(): 1 22240 r1i0n0 +1 [0] MPI startup(): 2 22241 r1i0n0 +1 [0] MPI startup(): 3 22242 r1i0n0 +1 [0] MPI startup(): 4 22243 r1i0n0 +1 [0] MPI startup(): 5 22244 r1i0n0 +1 [0] MPI startup(): 6 22245 r1i0n0 +1 [0] MPI startup(): 7 22246 r1i0n0 +1 [0] MPI startup(): 8 14354 r1i1n0 +1 [0] MPI startup(): 9 14355 r1i1n0 +1 [0] MPI startup(): 10 14356 r1i1n0 +1 [0] MPI startup(): 11 14357 r1i1n0 +1 [0] MPI startup(): 12 14358 r1i1n0 +1 [0] MPI startup(): 13 14359 r1i1n0 +1 [0] MPI startup(): 14 14360 r1i1n0 +1 [0] MPI startup(): 15 14361 r1i1n0 +1 [0] MPI startup(): I_MPI_DEBUG=5 [0] MPI startup(): I_MPI_FABRICS=shm:ofa Iteration 0 in rank 10 Iteration 1 in rank 10 Iteration 2 in rank 10 Iteration 3 in rank 10 Iteration 4 in rank 10 Iteration 5 in rank 10 Iteration 6 in rank 10 Iteration 7 in rank 10 Iteration 8 in rank 10 Iteration 9 in rank 10 Iteration 10 in rank 10 Iteration 11 in rank 10[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you able to test outside of SLURM? What distribution are you using? Please try these configurations:
[cpp]#define LOCAL_RANK 11 #define REMOTE_RANK 3[/cpp][cpp]#define LOCAL_RANK 11 #define REMOTE_RANK 4[/cpp][cpp]#define LOCAL_RANK 3 #define REMOTE_RANK 10[/cpp]
Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]#define LOCAL_RANK 11 #define REMOTE_RANK 4[/cpp] Hangs.
[cpp]#define LOCAL_RANK 3 #define REMOTE_RANK 10[/cpp]
Hangs as well.
The distribution is a SUSE Linux Enterprise Server 11.
I wasn't able to run the program outside of SLURM, at least not on this cluster. If you need this information, I can contact the help desk, maybe they know a way how to run the program without SLURM.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'll set up some virtual machines here to replicate your setup. Would you be able to run all of the processes on a single node (technically overdrawing resources, but for this program it shouldn't cause a problem).
For the new two that hang, do they hang atthe same iteration as the original?
Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And yes, they all hang in the same iteration.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It definitely appears to be related to having the tasks involved in the communication on different nodes. Are you able to reliably run other MPI programs involving these two nodes? Have you tried using a different fabric for your connection? What is the output from "env | grep I_MPI"?
Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Output of "env | grep I_MPI"[bash]I_MPI_FABRICS=shm:ofa I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so I_MPI_JOB_FAST_STARTUP=0 I_MPI_HOSTFILE=/tmp/di56zem/mpd.hosts.11693 I_MPI_ROOT=/lrz/sys/intel/mpi_40_3_00[/bash] I haven't tried any other MPI programs, but according to the service provider, the Intel MPI library should work.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have been able to reproduce the error you are receiving by matching the fabric. I'm going to do some more modifications to your code to see if I can get a more general reproducer, and I'll be submitting a defect report for this.
Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Would be nice, if you can post an update here as soon as this gets fixed in the latest release.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If this is still an issue for you, we have an engineering build which our developers have verified to fix this issue. Please let me know if you are still encountering this, and I will send you the engineering build.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page