Slow MPI communication between MICs

Chaulio_F_ · ‎11-08-2016

Hi,

I am running my application on a cluster with 2 MICs in each node (Salomon cluster). However, I've noticed that the MPI communication between both MICs is extremely slow whenever the message sizes are somewhere between 4KB and 128KB. I've been able to reproduce the problem with the Intel MPI benchmarks using these arguments: "PingPong -off_cache 16". This is the output for the benchmarks with 2 ranks, one on each MIC of the same node:

#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 4.1 Update 1, MPI-1 part    
#------------------------------------------------------------
# Date                  : Tue Nov  8 16:37:10 2016
# Machine               : k1om
# System                : Linux
# Release               : 2.6.38.8+mpss3.7.1
# Version               : #1 SMP Tue May 24 06:35:05 EDT 2016
# MPI Version           : 3.1
# MPI Thread Environment: 

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down 
# dynamically when a certain run time (per message size sample) 
# is expected to be exceeded. Time limit is defined by variable 
# "SECS_PER_SAMPLE" (=> IMB_settings.h) 
# or through the flag => -time 
  


# Calling sequence was: 

# /home/chaulio/IMB-MPI1.mic PingPong -off_cache 16

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM  
#
#

# List of Benchmarks to run:

# PingPong

#---------------------------------------------------
# Benchmarking PingPong 
# #processes = 2 
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         5.06         0.00
            1         1000         5.48         0.17
            2         1000         5.54         0.34
            4         1000         5.49         0.69
            8         1000         5.46         1.40
           16         1000         5.58         2.73
           32         1000         6.17         4.95
           64         1000         6.20         9.84
          128         1000         6.81        17.93
          256         1000         7.91        30.86
          512         1000         9.72        50.23
         1024         1000        13.86        70.46
         2048         1000        22.39        87.24
         4096         1000        37.53       104.07
         8192         1000      2130.88         3.67  <----- Problem starts here
        16384         1000      2104.48         7.42
        32768         1000      2235.86        13.98
        65536          640      1027.19        60.85
       131072          320      3057.71        40.88
       262144          160       257.40       971.26
       524288           80       325.60      1535.62
      1048576           40       459.33      2177.11
      2097152           20       737.25      2712.78
      4194304           10      1188.40      3365.88


# All processes entering MPI_Finalize

Some additional information that might be helpful for those reading this:

I am using the fabrics configuration as suggested in the cluster documentation, i.e:

$ export I_MPI_FABRICS=shm:dapl
$ export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0,ofa-v2-mcm-1

I have experienced the same problem when running the same benchmark on a different cluster, also with 2 MICs per node (SuperMIC cluster).
I have the same problem when trying Host-MIC communication (although it is a bit faster).
If I omit the flag "-off-cache 16" when running the benchmark, this problem doesn't happen. But I am using it intentionally, so that the communication pattern is very similar to that of my application. (Without this flag the benchmark uses a very small buffer, which is not the case in my application).
At first I thought it was a cache issue, but I also tried the benchmark with two MPI ranks running on the same MIC, and I didn't experience any slow-down then.
I also experimented with setting I_MPI_EAGER_THRESHOLD to different values such as 512, 16KB, 64KB, among others, and the slow-down for messages larger than 4KB happened in all cases. So it seems that the Eager protocol threshold is not an issue.

Has anyone experienced the same problem? Could you please try to reproduce this problem in other environments and check if you are able to solve it? Maybe there is an Intel MPI option that could fix this? I would appreciate any advices with this issue.

Thank you very much,

Chaulio