Solved: Hi Andrey,

Andrey_Vladimirov · ‎02-26-2014

This article by James Tullos explains how to use a list of DAPL providers in Intel MPI:

http://software.intel.com/en-us/articles/using-multiple-dapl-providers-with-the-intel-mpi-library

The first provider is used for small messages, the second for large messages inside of a node, and the third for large messages between nodes. My question is - how to control the switching point between the providers?

Dmitry_K_Intel2 · ‎02-27-2014

Hi Andrey,

To have the right provider selected you have to provide long names for each Intel® Xeon Phi™ Coprocessor. In your particular case the command line should look like:

mpirun -np 1 -host c002-n002-mic0 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1 PingPong : -np 1 -host c002-n002-mic1 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1

An alternative way to run applications is using of hostfiles. Create a hostfile:
$ cat hostfile.txt
c002-n002-mic0
c002-n002-mic1
$ export I_MPI_MIC_PREFIX=$I_MPI_ROOT/mic/bin/
$ mpirun -f hostfile.txt -ppn 1 -n 2 IMB-MPI1 PingPong

And please be careful with I_MPI_FABRICS set explicitly! I'd recommend using it by default. (or shm:dapl at least)

Best wishes!

View solution in original post

Andrey_Vladimirov · ‎02-26-2014

[accidental post, deleted]

Andrey_Vladimirov · ‎02-26-2014

Just wanted to clarify to avoid the message being moved to a different forum: this is a Xeon Phi-related problem.

We have a system on which the DAPL provider list works as expected. Namely, with I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1,ofa-v2-scif0,ofa-v2-mcm-1 we get the short latency of "mlx4" and the high bandwidth of "scif0" in PingPong between mic0 and mic1.

However, on another system, Intel MPI seems to be stuck with ofa-v2-mlx4_0-1 for all message sizes, which results in very low bandwidth for large messages (~200 MB/s between mic0 and mic1). At the same time, there are no complaints in the MPI debug information (I_MPI_DEBUG=5) about any of the three providers. Furthermore, this system can use ofa-v2-scif0 if it is specified as the only DAPL provider. This gets good bandwidth, but has greater latency for short messages than "mlx4".

So I was thinking that maybe the switching point is set by default by some (incorrect) heuristic analysis, and I need to enforce a switching point between providers in order to get it "unstuck" from the "mlx4" provider at large message sizes.

Loc_N_Intel · ‎02-27-2014

Hi Andrey,

Let's me contact the author of that paper or someone in the MPI team to answer for you. Thank you.

Gergana_S_Intel · ‎02-27-2014

Hey Andrey,

We provide a detailed description of the I_MPI_DAPL_PROVIDER_LIST env variable, and appropriate ways to change the threshold values in the Intel® MPI Library Reference Manual. We have a special section for the Intel® Xeon Phi™ Coprocessor Support and the details are listed under Environment Variables there.

In short, you can use the I_MPI_DAPL_DIRECT_COPY_THRESHOLD to change the provider thresholds. You can give that a try but I doubt it's an issue with the cutoffs. Are both of your systems setup the same? For example, the hostnames of your Phi cards are <host>-mic0 and <host>-mic1. Are there any significant differences between the two systems?

If you hit problems, providing the debug output will be helpful.

Let us know how it goes,
~Gergana

Andrey_Vladimirov · ‎02-27-2014

Hi Gergana,

Thanks! I would very much appreciate some help in getting this resolved. I have experimented with I_MPI_DAPL_DIRECT_COPY_THRESHOLD, but it does not changet the results.

The two systems are very different:

- The one that works as expected contains 4 Xeon Phi coprocessors, and the one that has problems contains 8.

- The one that works has a Mellanox FDR card with 1 port, the one that does not has a Mellanox FDR card with 2 ports with only one connected to the switch. (connecting both ports to switch does not resolve the issue, I checked).

- However, both systems are running the same Cent OS 6.5, MPSS 3.1.2 and Cluster Studio 14.0.0.080 Build 20130728, MPI 4.1.1.036, and networking is configured similarly (external Ethernet bridge with static IP, NFS-sharing of /opt/intel and /home).

I am sorry for the long listings below, but I thought I would rather give more information than less.

Here is my /etc/hosts:

[plain]
[root@c002-n002 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.32.0.1 mgmt
10.33.2.2 c002-n002
10.34.2.2 c002-n002-ib0
10.33.2.22    c002-n002-mic0 mic0
10.33.2.42    c002-n002-mic1 mic1
10.33.2.62    c002-n002-mic2 mic2
10.33.2.82    c002-n002-mic3 mic3
10.33.2.102    c002-n002-mic4 mic4
10.33.2.122    c002-n002-mic5 mic5
10.33.2.142    c002-n002-mic6 mic6
10.33.2.162    c002-n002-mic7 mic7
[/plain]

Here is my ifconfig:

[plain]
[cfxuser@c002-n002 ~]$ ifconfig
br0       Link encap:Ethernet HWaddr 00:E0:81:E4:84:EC
          inet addr:10.33.2.2 Bcast:10.35.255.255 Mask:255.252.0.0
          inet6 addr: fe80::2e0:81ff:fee4:84ec/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:5043 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3731 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:634298 (619.4 KiB) TX bytes:733850 (716.6 KiB)

eth0      Link encap:Ethernet HWaddr 00:E0:81:E4:84:EC
          inet6 addr: fe80::2e0:81ff:fee4:84ec/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:103979 errors:0 dropped:0 overruns:0 frame:0
          TX packets:36209 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:7326461 (6.9 MiB) TX bytes:8302421 (7.9 MiB)
          Memory:f7720000-f7740000

Ifconfig uses the ioctl access method to get the full address information, which limits hardware addresses to 8 bytes.
Because Infiniband address has 20 bytes, only the first 8 bytes are displayed correctly.
Ifconfig is obsolete! For replacement check ip.
ib0       Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
          inet addr:10.34.2.2 Bcast:10.35.255.255 Mask:255.254.0.0
          inet6 addr: fe80::2e0:8100:2a:d033/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5 errors:0 dropped:4 overruns:0 carrier:0
          collisions:0 txqueuelen:256
          RX bytes:0 (0.0 b) TX bytes:380 (380.0 b)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:237 errors:0 dropped:0 overruns:0 frame:0
          TX packets:237 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:12400 (12.1 KiB) TX bytes:12400 (12.1 KiB)

mic0      Link encap:Ethernet HWaddr 4C:79:BA:1A:0E:E1
          inet6 addr: fe80::4e79:baff:fe1a:ee1/64 Scope:Link
          UP BROADCAST RUNNING MTU:1500 Metric:1
          RX packets:16918 errors:0 dropped:0 overruns:0 frame:0
          TX packets:88688 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1855865 (1.7 MiB) TX bytes:12611321 (12.0 MiB)

mic0:ib   Link encap:Ethernet HWaddr 4C:79:BA:1A:0E:E1
          inet addr:192.0.2.100 Bcast:0.0.0.0 Mask:255.255.255.0
          UP BROADCAST RUNNING MTU:1500 Metric:1

mic1      Link encap:Ethernet HWaddr 4C:79:BA:1A:0C:CD
          inet6 addr: fe80::4e79:baff:fe1a:ccd/64 Scope:Link
          UP BROADCAST RUNNING MTU:1500 Metric:1
          RX packets:15700 errors:0 dropped:0 overruns:0 frame:0
          TX packets:87998 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1696077 (1.6 MiB) TX bytes:12587620 (12.0 MiB)

mic2      Link encap:Ethernet HWaddr 4C:79:BA:1A:0D:3F
          inet6 addr: fe80::4e79:baff:fe1a:d3f/64 Scope:Link
          UP BROADCAST RUNNING MTU:1500 Metric:1
          RX packets:592 errors:0 dropped:0 overruns:0 frame:0
          TX packets:70864 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:111099 (108.4 KiB) TX bytes:4454779 (4.2 MiB)

mic3      Link encap:Ethernet HWaddr 4C:79:BA:1A:0D:85
          inet6 addr: fe80::4e79:baff:fe1a:d85/64 Scope:Link
          UP BROADCAST RUNNING MTU:1500 Metric:1
          RX packets:592 errors:0 dropped:0 overruns:0 frame:0
          TX packets:70864 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:111099 (108.4 KiB) TX bytes:4454827 (4.2 MiB)

mic4      Link encap:Ethernet HWaddr 4C:79:BA:1A:0E:09
          inet6 addr: fe80::4e79:baff:fe1a:e09/64 Scope:Link
          UP BROADCAST RUNNING MTU:1500 Metric:1
          RX packets:592 errors:0 dropped:0 overruns:0 frame:0
          TX packets:70865 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:111099 (108.4 KiB) TX bytes:4454869 (4.2 MiB)

mic5      Link encap:Ethernet HWaddr 4C:79:BA:1A:0D:81
          inet6 addr: fe80::4e79:baff:fe1a:d81/64 Scope:Link
          UP BROADCAST RUNNING MTU:1500 Metric:1
          RX packets:591 errors:0 dropped:0 overruns:0 frame:0
          TX packets:70877 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:111057 (108.4 KiB) TX bytes:4455571 (4.2 MiB)

mic6      Link encap:Ethernet HWaddr 4C:79:BA:1A:0F:4F
          inet6 addr: fe80::4e79:baff:fe1a:f4f/64 Scope:Link
          UP BROADCAST RUNNING MTU:1500 Metric:1
          RX packets:591 errors:0 dropped:0 overruns:0 frame:0
          TX packets:70863 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:111045 (108.4 KiB) TX bytes:4454725 (4.2 MiB)

mic7      Link encap:Ethernet HWaddr 4C:79:BA:1A:0E:C7
          inet6 addr: fe80::4e79:baff:fe1a:ec7/64 Scope:Link
          UP BROADCAST RUNNING MTU:1500 Metric:1
          RX packets:591 errors:0 dropped:0 overruns:0 frame:0
          TX packets:70877 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:111057 (108.4 KiB) TX bytes:4455523 (4.2 MiB)
[/plain]

Here is my ibstatus:

[plain]
[cfxuser@c002-n002 ~]$ ibstatus
Infiniband device 'mlx4_0' port 1 status:
    default gid:     fe80:0000:0000:0000:00e0:8100:002a:d033
    base lid:     0x5
    sm lid:         0x1
    state:         4: ACTIVE
    phys state:     5: LinkUp
    rate:         40 Gb/sec (4X QDR)
    link_layer:     InfiniBand

Infiniband device 'mlx4_0' port 2 status:
    default gid:     fe80:0000:0000:0000:00e0:8100:002a:d034
    base lid:     0x0
    sm lid:         0x0
    state:         1: DOWN
    phys state:     3: Disabled
    rate:         40 Gb/sec (4X QDR)
    link_layer:     InfiniBand

Infiniband device 'scif0' port 1 status:
    default gid:     fe80:0000:0000:0000:4e79:baff:fe1a:0ee1
    base lid:     0x3e8
    sm lid:         0x1
    state:         4: ACTIVE
    phys state:     5: LinkUp
    rate:         40 Gb/sec (4X QDR)
    link_layer:     Ethernet

[cfxuser@c002-n002 ~]$
[/plain]

Here is the run with only the ofa-v2-scif0 provider (notice high latency and high bandwidth):

[plain]
[cfxuser@c002-n002 ~]$ export I_MPI_MIC=1
[cfxuser@c002-n002 ~]$ export I_MPI_FABRICS=dapl
[cfxuser@c002-n002 ~]$ export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-scif0
[cfxuser@c002-n002 ~]$ mpirun -np 1 -host mic0 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1 PingPong : -np 1 -host mic1 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1
benchmarks to run PingPong
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.2.4, MPI-1 part
#---------------------------------------------------
# Date                  : Thu Feb 27 11:03:58 2014
# Machine               : k1om
# System                : Linux
# Release               : 2.6.38.8+mpss3.1.2
# Version               : #1 SMP Wed Dec 18 19:09:36 PST 2013
# MPI Version           : 2.2
# MPI Thread Environment:

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time

# Calling sequence was:

# /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1 PingPong

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000        27.19         0.00
            1         1000        17.26         0.06
            2         1000        16.38         0.12
            4         1000        16.32         0.23
            8         1000        15.84         0.48
           16         1000        15.97         0.96
           32         1000        16.45         1.86
           64         1000        16.54         3.69
          128         1000        17.34         7.04
          256         1000        17.99        13.57
          512         1000        18.61        26.24
         1024         1000        20.22        48.30
         2048         1000        23.29        83.85
         4096         1000        29.39       132.89
         8192         1000        48.51       161.06
        16384         1000        83.50       187.13
        32768         1000       201.53       155.06
        65536          640       279.01       224.01
       131072          320       333.50       374.81
       262144          160       167.37      1493.74
       524288           80       222.23      2249.97
      1048576           40       321.46      3110.78
      2097152           20       534.68      3740.57
      4194304           10       955.61      4185.83

# All processes entering MPI_Finalize

[cfxuser@c002-n002 ~]$
[/plain]

Here is a run with only the provider ofa-v2-mlx4_0-1 (notice low latency and low bandwidth):

[plain]
[cfxuser@c002-n002 ~]$ export I_MPI_MIC=1[cfxuser@c002-n002 ~]$ export I_MPI_FABRICS=dapl[cfxuser@c002-n002 ~]$ export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1
[cfxuser@c002-n002 ~]$ mpirun -np 1 -host mic0 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1 PingPong : -np 1 -host mic1 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1
benchmarks to run PingPong
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.2.4, MPI-1 part
#---------------------------------------------------
# Date                  : Thu Feb 27 11:05:01 2014
# Machine               : k1om
# System                : Linux
# Release               : 2.6.38.8+mpss3.1.2
# Version               : #1 SMP Wed Dec 18 19:09:36 PST 2013
# MPI Version           : 2.2
# MPI Thread Environment:

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time

# Calling sequence was:

# /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1 PingPong

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         8.06         0.00
            1         1000        10.08         0.09
            2         1000         9.64         0.20
            4         1000         9.60         0.40
            8         1000         9.61         0.79
           16         1000         9.71         1.57
           32         1000        11.98         2.55
           64         1000        12.05         5.07
          128         1000        13.31         9.17
          256         1000         9.22        26.48
          512         1000        11.20        43.60
         1024         1000        14.86        65.74
         2048         1000        22.02        88.68
         4096         1000        33.90       115.24
         8192         1000        49.31       158.43
        16384         1000        80.52       194.05
        32768         1000       147.94       211.23
        65536          640       278.79       224.18
       131072          320       529.51       236.07
       262144          160       968.77       258.06
       524288           80      1927.34       259.42
      1048576           40      3810.97       262.40
      2097152           20      7582.88       263.75
      4194304           10     15113.25       264.67

# All processes entering MPI_Finalize

[cfxuser@c002-n002 ~]$
[/plain]

And here is a run with multiple providers that fails to switch providers (notice the low latency and low bandwidth like with only the ofa-v2-mlx4_0-1 provider). Tthe results produced by this run are identical to the results produced by the default configuration, i.e., with I_MPI_DAPL_PROVIDER_LIST not set.

[plain]
[cfxuser@c002-n002 ~]$ export I_MPI_MIC=1
[cfxuser@c002-n002 ~]$ export I_MPI_FABRICS=dapl
[cfxuser@c002-n002 ~]$ export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1,ofa-v2-scif0,ofa-v2-mcm-1
[cfxuser@c002-n002 ~]$ export I_MPI_DAPL_DIRECT_COPY_THRESHOLD=20000,200000,200000[cfxuser@c002-n002 ~]$ mpirun -np 1 -host mic0 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1 PingPong : -np 1 -host mic1 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1
benchmarks to run PingPong
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.2.4, MPI-1 part
#---------------------------------------------------
# Date                  : Thu Feb 27 10:57:43 2014
# Machine               : k1om
# System                : Linux
# Release               : 2.6.38.8+mpss3.1.2
# Version               : #1 SMP Wed Dec 18 19:09:36 PST 2013
# MPI Version           : 2.2
# MPI Thread Environment:

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time

# Calling sequence was:

# /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1 PingPong

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000        13.38         0.00
            1         1000         9.14         0.10
            2         1000         8.70         0.22
            4         1000         8.62         0.44
            8         1000         8.58         0.89
           16         1000         9.14         1.67
           32         1000        11.79         2.59
           64         1000        11.84         5.16
          128         1000        13.41         9.11
          256         1000         9.26        26.36
          512         1000        10.83        45.09
         1024         1000        14.43        67.69
         2048         1000        21.14        92.38
         4096         1000        32.04       121.92
         8192         1000        56.98       137.10
        16384         1000        88.17       177.21
        32768         1000       142.84       218.77
        65536          640       274.48       227.71
       131072          320       513.25       243.55
       262144          160      1013.34       246.71
       524288           80      1982.09       252.26
      1048576           40      3885.59       257.36
      2097152           20      7593.22       263.39
      4194304           10     14822.21       269.87

# All processes entering MPI_Finalize

[cfxuser@c002-n002 ~]$
[/plain]

Dmitry_K_Intel2 · ‎02-27-2014

Hi Andrey,

To have the right provider selected you have to provide long names for each Intel® Xeon Phi™ Coprocessor. In your particular case the command line should look like:

mpirun -np 1 -host c002-n002-mic0 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1 PingPong : -np 1 -host c002-n002-mic1 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1

An alternative way to run applications is using of hostfiles. Create a hostfile:
$ cat hostfile.txt
c002-n002-mic0
c002-n002-mic1
$ export I_MPI_MIC_PREFIX=$I_MPI_ROOT/mic/bin/
$ mpirun -f hostfile.txt -ppn 1 -n 2 IMB-MPI1 PingPong

And please be careful with I_MPI_FABRICS set explicitly! I'd recommend using it by default. (or shm:dapl at least)

Best wishes!

Andrey_Vladimirov · ‎02-28-2014

Thank you, Dmitry! That was it, I needed long hostnames.

Switching point between DAPL providers