When you have 16 MPI

Sangamesh_B_ · ‎10-05-2013

I'm facing a performance issue with a scientific application(Fortran). The issue is, it runs faster on single node but very slower on multiple nodes. For example, a 16 core job on single node finishes in 1hr 2mins, but the same job on two nodes (i.e. 8 cores per node & remaining 8 cores kept free) takes 3hr 20mins. The code is compiled with ifort-13.1.1, openmpi-1.4.5 and intel MKL libraries - lapack, blas, scalapack, blacs & fftw. What could be the problem here with?

I suspect, it may be problem with Intel MKL libraries(scalapack & blacs), as the hpl benchmark compiled with intel compilers & openmpi produces equivalent/accurate results on both single & multiple nodes runs.

Sangamesh_B_ · ‎10-05-2013

san wrote:

I'm facing a performance issue with a scientific application(Fortran). The issue is, it runs faster on single node but very slower on multiple nodes. For example, a 16 core job on single node finishes in 1hr 2mins, but the same job on two nodes (i.e. 8 cores per node & remaining 8 cores kept free) takes 3hr 20mins. The code is compiled with ifort-13.1.1, openmpi-1.4.5 and intel MKL libraries - lapack, blas, scalapack, blacs & fftw. What could be the problem here with?

I suspect, it may be problem with Intel MKL libraries(scalapack & blacs), as the hpl benchmark compiled with intel compilers & openmpi produces equivalent/accurate results on both single & multiple nodes runs.

More info: The cluster has Intel Sandybridge processor (E5-2670), infiniband and Hyperthreading is Enabled. Jobs are submitted thru LSF scheduler

Zhang_Z_Intel · ‎10-07-2013

Hello,

Without more details of your application, for example, which MKL routines it uses, I can only guess. I guess when you run the job on one node it has only one MPI rank, so essentially it is a shared-memory model. When you run the job on two nodes it has two MPI ranks. It is distributed memory and the communication between these two MPI ranks has to go through the infiniband interconnect. The cost of this communication may very likely be much more expensive than the shared-memory situation. This may explain why it runs slower.

I'd suggest you do some benchmarking to understand better where the bottleneck is. And if the problem does lie in MKL then we go from there to see how to improve.

Another piece of information: The HPL benchmark does not use MKL routines (scaLAPACK or BLACS).

Sangamesh_B_ · ‎10-08-2013

Zhang Z (Intel) wrote:
I guess when you run the job on one node it has only one MPI rank, so essentially it is a shared-memory model. When you run the job on two nodes it has two MPI ranks. It is distributed memory and the communication between these two MPI ranks has to go through the infiniband interconnect. The cost of this communication may very likely be much more expensive than the shared-memory situation. This may explain why it runs slower.

No. I'm very much aware of how to run MPI & OpenMP apps on cluster. 16 cores means 16 mpi processes here. So I've used 16 mpi processes on one node and 8 mpi processes per node on two nodes. I've linked with sequential MKL libraries only. No OpenMP threading here.

Zhang Z (Intel) wrote:
I'd suggest you do some benchmarking to understand better where the bottleneck is. And if the problem does lie in MKL then we go from there to see how to improve.

Another piece of information: The HPL benchmark does not use MKL routines (scaLAPACK or BLACS).

I mentioned about hpl, because that will clear that there is no problem with infiniband connectivity. So I've to suspect on MKL/OpenMPI/HyperThreading/hardware.

Zhang_Z_Intel · ‎10-09-2013

When you have 16 MPI processes on the same node, they communicate through the shared-memory Byte Transfer Layer. But when you split 16 MPI processes among two nodes then half of the communication goes through Infiniband. See the info from the OpenMPI doc: http://www.open-mpi.org/faq/?category=sm.

You should check what kind of Infiniband you have on your system, and what kind of bandwidth it sustains for the type of messages in your application. Remember the QPI bandwidth on E5-2670 is 8 GT/s. On-node communication through shared-memory is in general faster than off-node communication. This can explain the performance difference you see. Again, MPI profiling can help you to pinpoint the communication bottleneck.

Sangamesh_B_ · ‎10-21-2013

I did the profiling of the application with
mpiP. Following is the difference between the two runs:

Run 1: 16 mpi processes on single node

@--- MPI Time (seconds) ---------------------------------------------------
---------------------------------------------------------------------------
Task AppTime MPITime MPI%
   0 3.61e+03 661 18.32
   1 3.61e+03 627 17.37
   2 3.61e+03 700 19.39
   3 3.61e+03 665 18.41
   4 3.61e+03 702 19.45
   5 3.61e+03 703 19.48
   6 3.61e+03 740 20.50
   7 3.61e+03 763 21.14
...
...

Run 2: 16 mpi processes on two nodes - 8 mpi processes per node

@--- MPI Time (seconds) ---------------------------------------------------
---------------------------------------------------------------------------
Task AppTime MPITime MPI%
   0 1.27e+04 1.06e+04 84.14
   1 1.27e+04 1.07e+04 84.34
   2 1.27e+04 1.07e+04 84.20
   3 1.27e+04 1.07e+04 84.20
   4 1.27e+04 1.07e+04 84.22
   5 1.27e+04 1.07e+04 84.25
   6 1.27e+04 1.06e+04 84.02
   7 1.27e+04 1.07e+04 84.35
   8 1.27e+04 1.07e+04 84.29

The time spent for MPI functions in run 1 is less than 20%, where
as it is more than 80% in the run 2. For more details, I've attached both
output files. Please go thru these files and suggest what optimization we
can do with OpenMPI or Intel MKL.

Zhang_Z_Intel · ‎10-23-2013

What is your goal of optimization? What do you want to achieve? If one node with 16 ranks gives better performance, why do you bother with 2 nodes?

Those profiling results you collected for your second case (2 nodes with 8 ranks per node) clearly show that MPI collective operations (those involve all MPK ranks), including Alltoall, Allreduce, Alltoallv, and Bcast, account for more than 80% of application run time. These are the bottlenecks. These are inherently very expensive operations because every rank is communicating to every other rank simultaneously. In this case, half of the communication have to go through the interconnect which has a lower bandwidth. It is no surprise at all that this case runs much slower. If your goal is to improve MPI performance in your code, then you should consider how to avoid these collective operations as much as possible.