Extraordinarily Slow First AllToAllV Performance with Intel MPI Compared to MPT

Matt_Thompson · ‎09-28-2018

Dear Intel MPI Gurus,

We've been trying to track down why a code we can run quite well with HPE MPT on our Haswell-based SGI/HPE Infiniband-network cluster, but when we use Intel MPI, it's just way too slow. Eventually, we think we found it was the first AllToAllV code inside this program where Intel MPI was just "halting" before it proceeded. We've created a reproducer that seems to support this theory whose core is (I can share the full reproducer):

   do iter = 1,3
      t0 = mpi_wtime()
      call MPI_AlltoAllV(s_buf, s_count, s_disp,MPI_INTEGER, &
            r_buf, r_count, r_disp, MPI_INTEGER, MPI_COMM_WORLD, ierror)
      t1 = mpi_wtime()

      if (rank == 0) then
         write(*,*)"Iter: ", iter, " T = ", t1 - t0
      end if
   end do

We are doing 3 iterations of an MPI_AllToAllV call. That's it. The reproducer can vary the size of the buffers, etc.

So with HPE MPT 2.17 on this cluster, we see (for a "problem size" of 10000; it's a bit hard to figure out the size of what's happening in the actual code we first saw this in, but my guess is 10000 is smaller than reality):

# nprocs T1 T2 T3
 72 3.383068833500147E-003 1.581361982971430E-003 1.497713848948479E-003
192 8.767310064285994E-003 3.687836695462465E-003 3.472075797617435E-003
312 1.676454907283187E-002 8.718995843082666E-003 8.802385069429874E-003
432 1.770043326541781E-002 1.390126813203096E-002 1.413645874708891E-002
552 2.205356908962131E-002 1.850109826773405E-002 1.872858591377735E-002
672 3.307574009522796E-002 3.664174210280180E-002 3.548037912696600E-002

The first column is the number of processes and the next three are the MPI_Wtime numbers for each iteration of the loop.

Now let's look at Intel MPI 18.0.3.222:

# nprocs T1 T2 T3
 72 0.476876974105835 4.508972167968750E-003 5.246162414550781E-003
192 2.92623281478882 1.846385002136230E-002 1.933908462524414E-002
312 4.00109887123108 3.393721580505371E-002 3.367590904235840E-002
432 6.74378299713135 5.490398406982422E-002 5.541920661926270E-002
552 8.19235110282898 8.167219161987305E-002 8.110594749450684E-002
672 12.1262009143829 0.103807926177979 0.107892990112305

Well that's not good. The first MPI_AllToAllV call is much slower and the more processes, the worse it gets. At 672 processes it is nearly 4 orders of magnitude slower than HPE MPT. (Our cluster admins are working on getting Intel MPI 19 on but license server changes are making it fun so I can't report those numbers yet. This is also the *best* I can do by ignoring SLURM and running 10 cores per node. A straight mpirun is about 3x slower.)

Now, I do have access to a new cluster that is Skylake/OmniPath-based rather than Infiniband. If I run with Intel MPI 18.0.3.222 there:

# nprocs T1 T2 T3
 72 3.640890121459961E-003 2.669811248779297E-003 2.519130706787109E-003
192 9.490966796875000E-003 8.697032928466797E-003 8.977174758911133E-003
312 1.729822158813477E-002 1.571893692016602E-002 1.684498786926270E-002
432 2.593088150024414E-002 2.414894104003906E-002 2.196598052978516E-002
552 3.740596771240234E-002 3.293609619140625E-002 3.402209281921387E-002
672 5.194902420043945E-002 4.933309555053711E-002 5.183196067810059E-002

Better! So, plus side, OmniPath doesn't show this issue. Downside, the OmniPath cluster isn't available for general use yet and there will be far more HPE nodes for users to use even when it is.

My question for you is: Are there some environment variables we can set to allow Intel MPI to have comparable performance on the HPE nodes? It would be nice to start transitioning users from MPT to Intel MPI because the newer OmniPath cluster is not an HPE machine, so it can't have the HPE MPI stack on it. Thus, if we can start shaking out issues and making sure Intel MPI works and is performant, when we will essentially need to move users to use Intel MPI for portability, it will be an easy transition.

Thanks,

Matt