- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am testing Intel MPI 4.1 with test.c (the provided test program).
Whenever I run > 2000 ranks the program executes correctly but fails to end gracefully.
Running:
mpiexec.hydra -n 2001 -genv I_MPI_FABRICS shm:ofa -f hostfile ./testc
It stalls at
...
....
Hello World: Rank 2000 running on host xxxx
##<stalls here; does not return to command prompt>
(If I use -n 2000 or less, it runs perfectly.)
I have testing 3000 ranks using OpenMPI, so it doesn't seem to be a cluster/network issue.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Does it work with I_MPI_FABRICS=shm:dapl or I_MPI_FABRICS=shm:tcp? Please attach output with I_MPI_DEBUG=5.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you running more ranks than slots available? If so, were you enabling pinning on OpenMPI, and does it help to turn it off for Intel MPI?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1. DAPL UD works with > 2000 ranks.
2. Attached is the output from I_MPI_FABRICS shm:ofa - stalls after rank 0 receives a single message from ranks 1-2000
3. The cluster has more than 2000 slots: for OpenMPI/OFA I use --map-by socket, with no oversubscription to force the MPI to go across all the nodes.
I am using Mellanox OFED 2.2-1.0.1 on a mlx4 card.
The problem seems to be a MPI -> OFED interaction.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page