Solved: Performance issues with Intel MPI (RMA Put/Get) on Xeon Phi

Bryant_L_ · ‎07-17-2015

I'm getting bad performance with MPI_Put (and MPI_Get) in IMB-RMA All_put_all microbenchmark on this system configuration:

Single and multiple Xeon Phi coprocessors
Intel MPSS 3.5.1 (June 2015), Linux
Intel MPI Library 5.1.0.079
OFED-3.12-1 or OFED-3.18-rc3 (It doesn't really matter.)

Intel MPI runtime environment variables:

export I_MPI_MIC=1
export I_MPI_DEBUG=5
export I_MPI_FABRICS=shm:dapl
export I_MPI_DAPL_PROVIDER=ofa-v2-scif0
export I_MPI_PIN_MODE=lib
export I_MPI_PIN_CELL=core
export I_MPI_ADJUST_BARRIER=4

My results are below. IMB-RMA All_get_all is even worse. 209616.98 microseconds for 1-byte all_get_all.

$ mpirun -hosts mic0 -n 30 IMB-RMA All_put_all -npmin 30
bytes   Intel MPI 5.1   MPICH 3.1.4                                                                    
1           167157.12       1645.26
2           156630.21       1638.76
4           170201.55       1744.68
8           157363.03       1795.65
16          167803.04       1918.96
32          167826.86       1421.19
64          168686.47       1852.14
128         168729.71       2477.16
256         177143.31       1922.09
512         175115.94       2242.02
1024        160964.3        2603.5
2048        162915.96       3565.54
4096        178165.97       7120.21
8192        148391.07       9664.84
16384      ((timeout))      5854.41
32768                       8571.92
65536                       9698
131072                     16402.45
262144                     35356.35
524288                     82430.8
1048576                   137650.29
2097152                   275713.1
4194304                   430387.23

Sergey_O_Intel · ‎08-11-2015

hi all

sorry for delay in responce (vacation :) )

so, could you try to set env variable

I_MPI_SCALABLE_OPTIMIZATION=off

and run benchmarks again?

thank you

--Sergey

View solution in original post

Bryant_L_ · ‎07-18-2015

More numbers, this time with multiple coprocessors. My applications are trying to do -ppn 60, but cannot due to the extremely high latency for this communication pattern. The numbers here are with -ppn 10. The average time is also suspiciously low given the maximum time and the number of iterations.

/opt/intel/impi/5.1.0.079/bin64/mpirun -hosts mic0,mic1 -ppn 10 -n 20 /opt/intel/impi/5.1.0.079/mic/bin/IMB-RMA All_put_all -npmin 20
#----------------------------------------------------------------
# Benchmarking All_put_all 
# #processes = 20 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         9.01         9.03         9.01
            1            1   6044455.05  25502341.03  12162591.02
            2            1   6043900.01  63057655.10  14085288.83
            4            1   6047574.04 119880264.04  16803435.92
            8 time-out.

Frances_R_Intel · ‎07-20-2015

What do you get if you set no environment variables other than I_MPI_MIC=1? It strikes me as very strange that the times are not just long but keep bouncing around the same values.

Bryant_L_ · ‎07-20-2015

Performance is about the same with these environment variables:

export I_MPI_MIC=1
export I_MPI_DAPL_PROVIDER=ofa-v2-scif0

I require I_MPI_DAPL_PROVIDER or errors with SCIF connection occur due to the presence of a recently installed Mellanox InfiniBand ConnectX card.

...
mic1.hostname:MCM:1437:ab2e7b40: 3813 us(3813 us): scif_connect() to port 68, failed with error Connection refused
mic1.hostname:MCM:1437:ab2e7b40: 3879 us(66 us):  open_hca: SCIF init ERR for mlx4_0
...

These are the results:

/opt/intel/impi/5.1.0.079/bin64/mpirun -hosts mic0,mic1 -ppn 10 -n 20 /opt/intel/impi/5.1.0.079/mic/bin/IMB-RMA All_put_all -npmin 20
#----------------------------------------------------------------
# Benchmarking All_put_all 
# #processes = 20 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         9.00         9.02         9.01
            1            1   5862047.91  16005118.13  11632434.27
            2            1   6052767.04  51715856.08  13243087.48
            4            1   5856038.09  79225661.04  14525348.28

Artem_R_Intel1 · ‎07-21-2015

Hi Bryant,
Have you already created an Intel Premier support ticket for this issue?
Also could you please provide your MPICH 3.1.4 building options (it may help to reproduce the performance results).

Bryant_L_ · ‎07-21-2015

Hi Artem. I did create an Intel Premier support ticket (Issue ID: 6000118541). The support ticket has an Excel spreadsheet attached to it that has more information about build options, system setup, etc.

MPICH build options were:

mkdir build && cd build
PREFIX=/opt/xeon-phi/mpich
../configure CC="icc -mmic" CXX="icpc -mmic" FC="ifort -mmic" F77="ifort -mmic" \
    LIBS="-lscif" \
    --prefix=$PREFIX --host=x86_64-k1om-linux \
    --disable-fortran --disable-cxx --disable-romio \
    --with-device=ch3:nemesis:scif,tcp

Note that it doesn't matter which NETMOD is used. Both SCIF and TCP perform the same for the provided intra-coprocessor results.

Sergey_O_Intel · ‎08-11-2015

hi all

sorry for delay in responce (vacation :) )

so, could you try to set env variable

I_MPI_SCALABLE_OPTIMIZATION=off

and run benchmarks again?

thank you

--Sergey

Sergey_O_Intel · ‎08-12-2015

hi

glad to see this trick helped. unfortunately some 'optimizations' may negatively affect to performance of some parts of library. we will look how to improve performance of IMPI.

thank you for help

--Sergey

Bryant_L_ · ‎08-12-2015

Thanks Sergey. Results are a lot better now. Any other tips are appreciated!

# mic0 only
#----------------------------------------------------------------
# Benchmarking All_put_all 
# #processes = 20 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         9.01         9.28         9.04
            1         1000      1958.35      6149.54      5213.35
            2         1000      1933.73      5842.54      4912.68
            4         1000      1844.58      4922.10      4166.31
            8         1000       813.38      5254.21      3675.51
           16         1000      1990.48      4291.77      3553.58
           32         1000      2118.31      4424.42      4174.93
           64         1000      1914.85      5242.47      4364.39
          128         1000      2135.56      5652.61      4803.70
          256         1000       496.82      2641.65      1850.21
          512         1000       458.69      3307.59      2266.34
         1024         1000       409.40      3778.35      2688.41
         2048         1000       726.40      5068.91      4359.04
         4096         1000       635.39      5475.80      3997.22
         8192         1000       438.93      7427.49      4597.22
        16384         1000       835.50      5956.76      4511.96
        32768         1000      2124.04      8104.35      5886.51
        65536          640      1669.24      6120.93      4126.35
       131072          320      5436.16     11125.58      8086.85
       262144          160      6827.09     16458.80     12331.01
       524288           80     16029.50     29737.80     23515.52
      1048576           40     29509.77     56208.28     44146.67
      2097152           20     45546.20    102373.10     76828.56
      4194304           10    106930.99    199623.89    150745.65

# mic0 and mic1, ppn=10
#----------------------------------------------------------------
# Benchmarking All_put_all
# #processes = 20
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         9.01         9.04         9.02
            1         1000       938.84      2009.12      1647.31
            2         1000      1263.50      3198.81      2661.80
            4         1000       737.21      2470.52      2001.17
            8         1000      1114.56      3115.82      2308.15
           16         1000      1005.04      2620.89      2126.73
           32         1000      1118.84      3075.02      2413.24
           64         1000       467.71      2009.17      1557.10
          128         1000       641.46      2397.41      1531.80
          256         1000       801.47      2340.76      2061.25
          512         1000       706.79      3172.71      1982.05
         1024         1000       999.00      3606.51      2265.85
         2048         1000      1646.90      3568.09      2741.84
         4096         1000      2990.36      5275.52      4402.02
         8192          405     15819.40     25067.44     21854.30
        16384          405     15215.45     24422.11     21221.95
        32768          387     15266.97     26824.13     23057.98
        65536          363     18548.76     28672.06     24969.52
       131072          280     26215.21     36583.42     31850.96
       262144          160     33443.30     59622.76     49245.38
       524288           80     26353.59     47578.89     40713.24
      1048576           40     27929.85     88806.33     76020.28
      2097152           20     42961.11    112424.41     92187.03
      4194304           10     77796.48    202296.09    159834.60

EDIT 1 (2015-08-12-10-05): Updated results for mic0,mic1 run.