Solved: Performance issues with Intel MPI (barriers) between Xeon Phi coprocessors

Bryant_L_ · ‎06-27-2015

I'm getting bad performance with MPI barriers in a microbenchmark on this system configuration:

multiple Xeon Phi coprocessors
Intel MPSS 3.5 (April 2015), Linux
Intel MPI 5.0 update 3
OFED-3.12-1

export I_MPI_MIC=1
export I_MPI_DEBUG=5
export I_MPI_FABRICS=shm:dapl
export I_MPI_DAPL_PROVIDER=ofa-v2-scif0
export I_MPI_PIN_MODE=lib
export I_MPI_PIN_CELL=core
/opt/intel/impi/5.0.3.048/intel64/bin/mpirun -hosts mic0,mic1 -ppn 30 -n 60 ./exe

(( omitted tons of debug lines: DAPL and processor pinning are occurring correctly ))
[0] MPI startup(): I_MPI_DAPL_PROVIDER=ofa-v2-scif0
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=shm:dapl
[0] MPI startup(): I_MPI_MIC=1
[0] MPI startup(): I_MPI_PIN_MAPPING=30:0 1,1 9,2 17,3 25,4 33,5 41,6 49,7 57,8 65,9 73,10 81,11 89,12 97,13 105,14 113,15 121,16 129,17 137,18 145,19 153,20 161,21 169,22 177,23 185,24 193,25 201,26 209,27 217,28 225,29 0

# OSU MPI Barrier Latency Test
# Avg Latency(us)
          1795.31

I'm not sure if these results are representative of the mpirun configuration I've used (two coprocessors, ppn=30).

Additional results (ppn=60, 120 PEs):

/opt/intel/impi/5.0.3.048/intel64/bin/mpirun -hosts mic0,mic1 -ppn 60 -n 120 ./exe
# OSU MPI Barrier Latency Test
# Avg Latency(us)
          5378.48

Bryant_L_ · ‎07-17-2015

Thanks Artem. This issue is related to an Intel Premier support ticket with Issue ID: 6000114596.

I was able to get a sufficient resolution by using I_MPI_ADJUST_BARRIER=4 # topology-aware recursive-doubling algorithm, but this issue may be more intrusive to the Xeon Phi than just using this environment variable. Other algorithms or operations may also be affected.

On related note, can you also look at the performance for:

mpirun -hosts mic0 -np 30 IMB-RMA All_put_all -npmin 30

See this thread: https://software.intel.com/en-us/forums/topic/562296

View solution in original post

Bryant_L_ · ‎06-27-2015

More system details. No InfiniBand card in this system. Should the link_layer for scif0 on the host report Ethernet or InfiniBand for this system? Most examples I can find show link_layer reporting InfiniBand for scif0, but those were systems with a Mellanox card (mlx4_0).

host$ ibv_devinfo
hca_id: scif0
        transport:                      iWARP (1)
        fw_ver:                         0.0.1
        node_guid:                      4c79:baff:fe34:0f11
        sys_image_guid:                 4c79:baff:fe34:0f11
        vendor_id:                      0x8086
        vendor_part_id:                 0
        hw_ver:                         0x1
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1000
                        port_lmc:               0x00
                        link_layer:             Ethernet

host$ ssh mic0 ibv_devinfo
hca_id: scif0
        transport:                      SCIF (2)
        fw_ver:                         0.0.1
        node_guid:                      4c79:baff:fe34:0f10
        sys_image_guid:                 4c79:baff:fe34:0f10
        vendor_id:                      0x8086
        vendor_part_id:                 0
        hw_ver:                         0x1
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1001
                        port_lmc:               0x00
                        link_layer:             SCIF

Frances_R_Intel · ‎06-29-2015

I hate to admit it but I had never noticed that in some cases, the link layer shows up as InfiniBand and in other cases, as Ethernet. It is not just a matter of there being an adapter or not.

While I am thinking about this (or until someone more knowledgeable chimes in), have you tried running the benchmarks Intel includes with the Intel MPI? How did those perform? Is it only the barrier test that is a problem?

Bryant_L_ · ‎06-29-2015

Looking over the examples again, there doesn't appear to be the correlation that I thought there was with scif0's link_layer appearing as InfiniBand due to the presence of a hardware card. In any case, it's not much of an issue if it isn't causing this barrier performance anomaly. Just something I thought was interesting and wanted to see if others knew more about.

I hadn't considered the Intel MPI benchmarks, so I ran some numbers at your suggestion (-hosts mic0,mic1 -ppn 30 -n 60):

host$  mpirun -hosts mic0,mic1 -ppn 30 -n 60 \
  /opt/intel/impi/5.0.3.048/mic/bin/IMB-MPI1 Barrier
[0] MPI startup(): Multi-threaded optimized library
[58] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0
... omitted
[8] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0
[34] MPI startup(): DAPL provider ofa-v2-scif0
[45] MPI startup(): DAPL provider ofa-v2-scif0
[44] MPI startup(): DAPL provider ofa-v2-scif0
[34] MPI startup(): shm and dapl data transfer modes
[44] MPI startup(): shm and dapl data transfer modes
[45] MPI startup(): shm and dapl data transfer modes
... omitted
[31] MPI startup(): DAPL provider ofa-v2-scif0
[46] MPI startup(): DAPL provider ofa-v2-scif0
[39] MPI startup(): DAPL provider ofa-v2-scif0
[5] MPI startup(): shm and dapl data transfer modes
[6] MPI startup(): shm and dapl data transfer modes
[11] MPI startup(): shm and dapl data transfer modes
[31] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000
[31] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000
... omitted
[0] MPI startup(): Rank    Pid      Node name               Pin cpu
... affinity setting
[0] MPI startup(): I_MPI_DAPL_PROVIDER=ofa-v2-scif0
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=shm:dapl
[0] MPI startup(): I_MPI_MIC=1
[0] MPI startup(): I_MPI_PIN_MAPPING=30:0 1,1 9,2 17,3 25,4 33,5 41,6 49,7 57,8 65,9 73,10 81,11 89,12 97,13 105,14 113,15 121,16 129,17 137,18 145,19 153,20 161,21 169,22 177,23 185,24 193,25 201,26 209,27 217,28 225,29 0
 benchmarks to run Barrier 
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 4.0 Update 1, MPI-1 part    
#------------------------------------------------------------
# Date                  : Mon Jun 29 19:38:26 2015
# Machine               : k1om
# System                : Linux
# Release               : 2.6.38.8+mpss3.5
# Version               : #1 SMP Thu Apr 2 02:17:02 PDT 2015
# MPI Version           : 3.0
# MPI Thread Environment: 

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down 
# dynamically when a certain run time (per message size sample) 
# is expected to be exceeded. Time limit is defined by variable 
# "SECS_PER_SAMPLE" (=> IMB_settings.h) 
# or through the flag => -time 
  


# Calling sequence was: 

# /opt/intel/impi/5.0.3.048/mic/bin/IMB-MPI1 Barrier

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM  
#
#

# List of Benchmarks to run:

# Barrier

#---------------------------------------------------
# Benchmarking Barrier 
# #processes = 2 
# ( 58 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000         4.63         4.64         4.63

#---------------------------------------------------
# Benchmarking Barrier 
# #processes = 4 
# ( 56 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000        10.53        10.53        10.53

#---------------------------------------------------
# Benchmarking Barrier 
# #processes = 8 
# ( 52 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000        15.86        15.87        15.87

#---------------------------------------------------
# Benchmarking Barrier 
# #processes = 16 
# ( 44 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000        21.68        21.69        21.68

#---------------------------------------------------
# Benchmarking Barrier 
# #processes = 32 
# ( 28 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000       445.00       445.35       445.27

#---------------------------------------------------
# Benchmarking Barrier 
# #processes = 60 
#---------------------------------------------------
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000      1875.15      1877.34      1876.12


# All processes entering MPI_Finalize

At n=60, Intel MPI is reporting 1876 us latency, approximately the same as the results reported above. My baselines for comparison include MPICH 3.1.4 using ch3:nemesis:scif (280 us latency) and my own OpenSHMEM library that I've been developing for multiple Xeon Phi coprocessors leveraging SCIF (14 us).

So far, the barrier performance seems to be the only one out of line with expectations. Here are some of the other results from IMB_MPI1. If it helps, the MPI library and compilers are from Parallel Studio XE Cluster Edition Update 3. Thanks for your help!

Runtime Configuration: -hosts mic0,mic1 -ppn 1 -n 2

#---------------------------------------------------
# Benchmarking PingPong 
# #processes = 2 
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000        21.58         0.00
            1         1000        21.67         0.04
            2         1000        16.63         0.11
            4         1000        16.73         0.23
            8         1000        16.70         0.46
           16         1000        16.82         0.91
           32         1000        17.78         1.72
           64         1000        17.86         3.42
          128         1000        18.53         6.59
          256         1000        19.21        12.71
          512         1000        20.15        24.23
         1024         1000        21.75        44.90
         2048         1000        24.78        78.81
         4096         1000        31.99       122.12
         8192         1000       164.00        47.64
        16384         1000       192.12        81.33
        32768         1000       165.40       188.94
        65536          640       227.15       275.14
       131072          320       591.85       211.20
       262144          160       240.43      1039.81
       524288           80       287.78      1737.44
      1048576           40       409.79      2440.29
      2097152           20       630.00      3174.62
      4194304           10       969.10      4127.54

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv 
# #processes = 2 
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000        23.97        23.98        23.98         0.00
            1         1000        24.82        24.82        24.82         0.08
            2         1000        24.80        24.81        24.80         0.15
            4         1000        24.85        24.86        24.86         0.31
            8         1000        24.87        24.88        24.88         0.61
           16         1000        24.88        24.88        24.88         1.23
           32         1000        26.83        26.84        26.83         2.27
           64         1000        26.84        26.84        26.84         4.55
          128         1000        27.49        27.49        27.49         8.88
          256         1000        28.36        28.37        28.37        17.21
          512         1000        29.42        29.42        29.42        33.20
         1024         1000        30.99        30.99        30.99        63.03
         2048         1000        33.95        33.96        33.95       115.02
         4096         1000        41.27        41.28        41.27       189.27
         8192         1000       200.06       200.08       200.07        78.09
        16384         1000       245.54       245.56       245.55       127.26
        32768         1000       202.43       202.44       202.43       308.74
        65536          640       179.31       179.34       179.32       697.02
       131072          320       194.58       194.61       194.59      1284.65
       262144          160       266.17       266.23       266.20      1878.06
       524288           80       333.99       334.20       334.09      2992.22
      1048576           40       463.47       463.65       463.56      4313.57
      2097152           20       608.85       610.26       609.55      6554.62
      4194304           10      1024.10      1024.51      1024.31      7808.62

#-----------------------------------------------------------------------------
# Benchmarking Exchange 
# #processes = 2 
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000        48.23        48.24        48.24         0.00
            1         1000        49.82        49.84        49.83         0.08
            2         1000        49.89        49.89        49.89         0.15
            4         1000        49.92        49.93        49.93         0.31
            8         1000        50.07        50.07        50.07         0.61
           16         1000        50.08        50.09        50.09         1.22
           32         1000        53.63        53.65        53.64         2.28
           64         1000        53.88        53.88        53.88         4.53
          128         1000        55.43        55.44        55.43         8.81
          256         1000        56.83        56.84        56.84        17.18
          512         1000        58.72        58.73        58.72        33.26
         1024         1000        61.50        61.52        61.51        63.50
         2048         1000        67.96        67.97        67.97       114.94
         4096         1000        82.46        82.49        82.48       189.41
         8192         1000       406.44       406.45       406.44        76.89
        16384         1000       447.09       447.11       447.10       139.79
        32768         1000       415.22       415.24       415.23       301.03
        65536          640       363.31       363.32       363.32       688.09
       131072          320       400.70       400.73       400.72      1247.71
       262144          160       547.68       547.73       547.70      1825.71
       524288           80       676.29       676.60       676.44      2955.96
      1048576           40       938.20       938.27       938.24      4263.15
      2097152           20      1301.10      1302.55      1301.83      6141.79
      4194304           10      2078.39      2080.80      2079.59      7689.36

#----------------------------------------------------------------
# Benchmarking Allreduce 
# #processes = 2 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.55         0.59         0.57
            4         1000        32.70        32.70        32.70
            8         1000        32.64        32.64        32.64
           16         1000        32.76        32.76        32.76
           32         1000        34.72        34.73        34.72
           64         1000        34.96        34.97        34.97
          128         1000        37.40        37.41        37.41
          256         1000        38.32        38.32        38.32
          512         1000        39.54        39.54        39.54
         1024         1000        41.38        41.39        41.38
         2048         1000        45.91        45.91        45.91
         4096         1000        55.61        55.61        55.61
         8192         1000       103.17       103.18       103.18
        16384         1000       385.73       385.75       385.74
        32768         1000       742.72       742.75       742.73
        65536          640      1282.00      1282.04      1282.02
       131072          320      2346.32      2346.40      2346.36
       262144          160      4770.80      4770.97      4770.89
       524288           80      8553.10      8553.42      8553.26
      1048576           40     17021.97     17022.68     17022.32
      2097152           20      7646.10      7658.95      7652.52
      4194304           10     12146.31     12148.62     12147.46

#---------------------------------------------------
# Benchmarking Barrier 
# #processes = 2 
#---------------------------------------------------
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000        26.09        26.09        26.09

[other tests available on request]

Bryant_L_ · ‎07-04-2015

I tried changing the Intel compiler suite I was using (completely uninstalled one and installed the other). These are the barrier-latency numbers I am getting for the two suites. (See EDIT 1.)Notably, I tested Intel MPI 4.1.3.04{8,9} and Intel MPI 5.0.3.048 with BOTH suites and it did not affect the numbers at all. This leads me to believe that Intel MPI is not causing this performance difference but is something else installed alongside the suite.

    mpirun options                          Intel MPI 4.1.3.049             Intel MPI 5.0.3.048
-hosts mic0,mic1 -ppn 30 -n 60                   215.84 microseconds            1848.86
-hosts mic0,mic2 -ppn 30 -n 60                   214.67                         1906.14
-hosts mic0,mic3 -ppn 30 -n 60                   283.05                         1828.22
-hosts mic0,mic1 -ppn 60 -n 120                 4828.81                         4733.04
-hosts mic0,mic2 -ppn 60 -n 120                 4676.96                         4636.11
-hosts mic0,mic3 -ppn 60 -n 120                 4588.63                         4304.11
-hosts mic0,mic1,mic2,mic3 -ppn 15 -n 60         229.26                          920.36
-hosts mic0,mic1,mic2,mic3 -ppn 60 -n 240        436.72                         6099.88

I also tested with OFED 3.12-1 and OFED 3.18-rc3 with no difference in performance to the numbers above.

Here's some system information:

CentOS 6.6	
kernel version 2.6.32-504.23.4.el6.x86_64	
Dual-socket Super Micro motherboard (X9DRG-QF)	
Two Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz	
Four Xeon Phi 5110P coprocessors	
	mic0 and mic1 are separated from mic2 and mic3 via Intel QPI through the host processors.

I'm talking with Steve from Intel Premier Support and he mentioned his system configuration is performing at 292us (-ppn 30 -n 60) and 421us (-ppn 60 -n 120) which is much closer to the numbers I would expect. Any further suggestions would be greatly appreciated!

EDIT 1 (2015-07-05-00-46 EST): I uninstalled both development suites and tested only Intel MPI using the runtime distributables and the included IMB-MPI1. The respective Intel MPI versions from the two development suite match the results in the table above. At least now I've traced the performance issues to the Intel MPI versions specifically.

Artem_R_Intel1 · ‎07-17-2015

Hi Bryant,
I've reproduced the performance regress on my system with IMB barrier.
I'll create an internal tracker for the issue.
Thanks for the reporting!

Bryant_L_ · ‎07-17-2015

Thanks Artem. This issue is related to an Intel Premier support ticket with Issue ID: 6000114596.

I was able to get a sufficient resolution by using I_MPI_ADJUST_BARRIER=4 # topology-aware recursive-doubling algorithm, but this issue may be more intrusive to the Xeon Phi than just using this environment variable. Other algorithms or operations may also be affected.

On related note, can you also look at the performance for:

mpirun -hosts mic0 -np 30 IMB-RMA All_put_all -npmin 30

See this thread: https://software.intel.com/en-us/forums/topic/562296