I'm hoping the Intel MPI gurus can help with this. Recently I've tried transitioning some code I help maintain (GEOS, a climate model) from using HPE MPT (2.17, in this case) to Intel MPI (18.0.1; 18.0.2 I'll test soon). In both cases, the compiler (Intel 18.0.1) is the same, both running on the same set of Haswell nodes on an SGI/HPE cluster. The only difference is the MPI stack.
Now one part of the code (AGCM, the physics/dynamics part) is actually a little bit faster with Intel MPI than MPT, even on an SGI machine. That is nice. It's maybe 5-10% faster in some cases. Huzzah!
But, another code (GSI, analysis of observation data) really, really, really does not like Intel MPI. This code displays two issues. First, after the code starts (both launch very fast) it eventually hits a point at which, we believe, the first collective occurs at which point the whole code stalls as it...initializes buffers? Something with Infiniband maybe? We don't know. MPT slows a bit too, but doesn't show this issue nearly as badly as IMPI. We had another place like this in the AGCM where moving from a collective to an Isend/Recv/Wait type paradigm really helped. This "stall" is annoying and, worse, it gets longer and longer as the number of cores increase. (We might have a reproducer for this one.)
But, that is minor really. A minute or so, compared to the overall performance. On 240 cores, MPT 2.17 runs this code in 15:03 (minutes:seconds), Intel MPI 18.0.1, 28:12. On 672 cores, MPT 2.17 runs the code in 12:02 and Intel MPI 18.0.2 in 21:47; doesn't scale well overall for either.
Using I_MPI_STATS, the code is seen to be ~60% MPI in Alltoallv (20% of wall) at 240 cores; at 672, Barrier starts to win, but Alltoallv is still 40% MPI, 23% walltime. I've tried running by setting both I_MPI_ADJUST_ALLTOALLV options (1 and 2) and it does little at all (28:44 and 28:25 at 240).
I'm going to try and see if I can request/reserve a set of nodes for a long time to do an mpitune run, but since each run is ~30 minutes...mpitune will not be fun as it'd be 90 minutes for each option test.
Any ideas on what might be happening? Any advice for flags/environment variables to try? I understand that HPE MPT might/should work best on an SGI/HPE machine (like how Intel compilers seem to do best with Intel chips), but this seems a bit beyond the usual difference. I've requested MVAPICH2 be installed as well for another comparison.
Bit of an update: I was able to run with Open MPI, and it took about 30 minutes as well (about the same as Intel MPI, our disks seem to be having a day which slowed some initializing). I tried MVAPICH2 too but it just exploded, so I'm guessing some low-level interaction between Infiniband and me doing something wrong in calling/compiling.
Perhaps HPE just has a really well-optimized Alltoallv for an HPE machine? I believe we are getting an Omni-Path cluster soon, so perhaps Intel MPI will come into its own on it?