- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This article by James Tullos explains how to use a list of DAPL providers in Intel MPI:
http://software.intel.com/en-us/articles/using-multiple-dapl-providers-with-the-intel-mpi-library
The first provider is used for small messages, the second for large messages inside of a node, and the third for large messages between nodes. My question is - how to control the switching point between the providers?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Andrey,
To have the right provider selected you have to provide long names for each Intel® Xeon Phi™ Coprocessor. In your particular case the command line should look like:
mpirun -np 1 -host c002-n002-mic0 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1 PingPong : -np 1 -host c002-n002-mic1 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1
An alternative way to run applications is using of hostfiles. Create a hostfile:
$ cat hostfile.txt
c002-n002-mic0
c002-n002-mic1
$ export I_MPI_MIC_PREFIX=$I_MPI_ROOT/mic/bin/
$ mpirun -f hostfile.txt -ppn 1 -n 2 IMB-MPI1 PingPong
And please be careful with I_MPI_FABRICS set explicitly! I'd recommend using it by default. (or shm:dapl at least)
Best wishes!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[accidental post, deleted]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just wanted to clarify to avoid the message being moved to a different forum: this is a Xeon Phi-related problem.
We have a system on which the DAPL provider list works as expected. Namely, with I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1,ofa-v2-scif0,ofa-v2-mcm-1 we get the short latency of "mlx4" and the high bandwidth of "scif0" in PingPong between mic0 and mic1.
However, on another system, Intel MPI seems to be stuck with ofa-v2-mlx4_0-1 for all message sizes, which results in very low bandwidth for large messages (~200 MB/s between mic0 and mic1). At the same time, there are no complaints in the MPI debug information (I_MPI_DEBUG=5) about any of the three providers. Furthermore, this system can use ofa-v2-scif0 if it is specified as the only DAPL provider. This gets good bandwidth, but has greater latency for short messages than "mlx4".
So I was thinking that maybe the switching point is set by default by some (incorrect) heuristic analysis, and I need to enforce a switching point between providers in order to get it "unstuck" from the "mlx4" provider at large message sizes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Andrey,
Let's me contact the author of that paper or someone in the MPI team to answer for you. Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey Andrey,
We provide a detailed description of the I_MPI_DAPL_PROVIDER_LIST env variable, and appropriate ways to change the threshold values in the Intel® MPI Library Reference Manual. We have a special section for the Intel® Xeon Phi™ Coprocessor Support and the details are listed under Environment Variables there.
In short, you can use the I_MPI_DAPL_DIRECT_COPY_THRESHOLD to change the provider thresholds. You can give that a try but I doubt it's an issue with the cutoffs. Are both of your systems setup the same? For example, the hostnames of your Phi cards are <host>-mic0 and <host>-mic1. Are there any significant differences between the two systems?
If you hit problems, providing the debug output will be helpful.
Let us know how it goes,
~Gergana
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Gergana,
Thanks! I would very much appreciate some help in getting this resolved. I have experimented with I_MPI_DAPL_DIRECT_COPY_THRESHOLD, but it does not changet the results.
The two systems are very different:
- The one that works as expected contains 4 Xeon Phi coprocessors, and the one that has problems contains 8.
- The one that works has a Mellanox FDR card with 1 port, the one that does not has a Mellanox FDR card with 2 ports with only one connected to the switch. (connecting both ports to switch does not resolve the issue, I checked).
- However, both systems are running the same Cent OS 6.5, MPSS 3.1.2 and Cluster Studio 14.0.0.080 Build 20130728, MPI 4.1.1.036, and networking is configured similarly (external Ethernet bridge with static IP, NFS-sharing of /opt/intel and /home).
I am sorry for the long listings below, but I thought I would rather give more information than less.
Here is my /etc/hosts:
[plain]
[root@c002-n002 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.32.0.1 mgmt
10.33.2.2 c002-n002
10.34.2.2 c002-n002-ib0
10.33.2.22 c002-n002-mic0 mic0
10.33.2.42 c002-n002-mic1 mic1
10.33.2.62 c002-n002-mic2 mic2
10.33.2.82 c002-n002-mic3 mic3
10.33.2.102 c002-n002-mic4 mic4
10.33.2.122 c002-n002-mic5 mic5
10.33.2.142 c002-n002-mic6 mic6
10.33.2.162 c002-n002-mic7 mic7
[/plain]
Here is my ifconfig:
[plain]
[cfxuser@c002-n002 ~]$ ifconfig
br0 Link encap:Ethernet HWaddr 00:E0:81:E4:84:EC
inet addr:10.33.2.2 Bcast:10.35.255.255 Mask:255.252.0.0
inet6 addr: fe80::2e0:81ff:fee4:84ec/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:5043 errors:0 dropped:0 overruns:0 frame:0
TX packets:3731 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:634298 (619.4 KiB) TX bytes:733850 (716.6 KiB)
eth0 Link encap:Ethernet HWaddr 00:E0:81:E4:84:EC
inet6 addr: fe80::2e0:81ff:fee4:84ec/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:103979 errors:0 dropped:0 overruns:0 frame:0
TX packets:36209 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:7326461 (6.9 MiB) TX bytes:8302421 (7.9 MiB)
Memory:f7720000-f7740000
Ifconfig uses the ioctl access method to get the full address information, which limits hardware addresses to 8 bytes.
Because Infiniband address has 20 bytes, only the first 8 bytes are displayed correctly.
Ifconfig is obsolete! For replacement check ip.
ib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
inet addr:10.34.2.2 Bcast:10.35.255.255 Mask:255.254.0.0
inet6 addr: fe80::2e0:8100:2a:d033/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:5 errors:0 dropped:4 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:0 (0.0 b) TX bytes:380 (380.0 b)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:237 errors:0 dropped:0 overruns:0 frame:0
TX packets:237 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:12400 (12.1 KiB) TX bytes:12400 (12.1 KiB)
mic0 Link encap:Ethernet HWaddr 4C:79:BA:1A:0E:E1
inet6 addr: fe80::4e79:baff:fe1a:ee1/64 Scope:Link
UP BROADCAST RUNNING MTU:1500 Metric:1
RX packets:16918 errors:0 dropped:0 overruns:0 frame:0
TX packets:88688 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1855865 (1.7 MiB) TX bytes:12611321 (12.0 MiB)
mic0:ib Link encap:Ethernet HWaddr 4C:79:BA:1A:0E:E1
inet addr:192.0.2.100 Bcast:0.0.0.0 Mask:255.255.255.0
UP BROADCAST RUNNING MTU:1500 Metric:1
mic1 Link encap:Ethernet HWaddr 4C:79:BA:1A:0C:CD
inet6 addr: fe80::4e79:baff:fe1a:ccd/64 Scope:Link
UP BROADCAST RUNNING MTU:1500 Metric:1
RX packets:15700 errors:0 dropped:0 overruns:0 frame:0
TX packets:87998 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1696077 (1.6 MiB) TX bytes:12587620 (12.0 MiB)
mic2 Link encap:Ethernet HWaddr 4C:79:BA:1A:0D:3F
inet6 addr: fe80::4e79:baff:fe1a:d3f/64 Scope:Link
UP BROADCAST RUNNING MTU:1500 Metric:1
RX packets:592 errors:0 dropped:0 overruns:0 frame:0
TX packets:70864 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:111099 (108.4 KiB) TX bytes:4454779 (4.2 MiB)
mic3 Link encap:Ethernet HWaddr 4C:79:BA:1A:0D:85
inet6 addr: fe80::4e79:baff:fe1a:d85/64 Scope:Link
UP BROADCAST RUNNING MTU:1500 Metric:1
RX packets:592 errors:0 dropped:0 overruns:0 frame:0
TX packets:70864 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:111099 (108.4 KiB) TX bytes:4454827 (4.2 MiB)
mic4 Link encap:Ethernet HWaddr 4C:79:BA:1A:0E:09
inet6 addr: fe80::4e79:baff:fe1a:e09/64 Scope:Link
UP BROADCAST RUNNING MTU:1500 Metric:1
RX packets:592 errors:0 dropped:0 overruns:0 frame:0
TX packets:70865 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:111099 (108.4 KiB) TX bytes:4454869 (4.2 MiB)
mic5 Link encap:Ethernet HWaddr 4C:79:BA:1A:0D:81
inet6 addr: fe80::4e79:baff:fe1a:d81/64 Scope:Link
UP BROADCAST RUNNING MTU:1500 Metric:1
RX packets:591 errors:0 dropped:0 overruns:0 frame:0
TX packets:70877 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:111057 (108.4 KiB) TX bytes:4455571 (4.2 MiB)
mic6 Link encap:Ethernet HWaddr 4C:79:BA:1A:0F:4F
inet6 addr: fe80::4e79:baff:fe1a:f4f/64 Scope:Link
UP BROADCAST RUNNING MTU:1500 Metric:1
RX packets:591 errors:0 dropped:0 overruns:0 frame:0
TX packets:70863 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:111045 (108.4 KiB) TX bytes:4454725 (4.2 MiB)
mic7 Link encap:Ethernet HWaddr 4C:79:BA:1A:0E:C7
inet6 addr: fe80::4e79:baff:fe1a:ec7/64 Scope:Link
UP BROADCAST RUNNING MTU:1500 Metric:1
RX packets:591 errors:0 dropped:0 overruns:0 frame:0
TX packets:70877 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:111057 (108.4 KiB) TX bytes:4455523 (4.2 MiB)
[/plain]
Here is my ibstatus:
[plain]
[cfxuser@c002-n002 ~]$ ibstatus
Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:00e0:8100:002a:d033
base lid: 0x5
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: InfiniBand
Infiniband device 'mlx4_0' port 2 status:
default gid: fe80:0000:0000:0000:00e0:8100:002a:d034
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 40 Gb/sec (4X QDR)
link_layer: InfiniBand
Infiniband device 'scif0' port 1 status:
default gid: fe80:0000:0000:0000:4e79:baff:fe1a:0ee1
base lid: 0x3e8
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
link_layer: Ethernet
[cfxuser@c002-n002 ~]$
[/plain]
Here is the run with only the ofa-v2-scif0 provider (notice high latency and high bandwidth):
[plain]
[cfxuser@c002-n002 ~]$ export I_MPI_MIC=1
[cfxuser@c002-n002 ~]$ export I_MPI_FABRICS=dapl
[cfxuser@c002-n002 ~]$ export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-scif0
[cfxuser@c002-n002 ~]$ mpirun -np 1 -host mic0 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1 PingPong : -np 1 -host mic1 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1
benchmarks to run PingPong
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V3.2.4, MPI-1 part
#---------------------------------------------------
# Date : Thu Feb 27 11:03:58 2014
# Machine : k1om
# System : Linux
# Release : 2.6.38.8+mpss3.1.2
# Version : #1 SMP Wed Dec 18 19:09:36 PST 2013
# MPI Version : 2.2
# MPI Thread Environment:
# New default behavior from Version 3.2 on:
# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time
# Calling sequence was:
# /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1 PingPong
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# PingPong
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 27.19 0.00
1 1000 17.26 0.06
2 1000 16.38 0.12
4 1000 16.32 0.23
8 1000 15.84 0.48
16 1000 15.97 0.96
32 1000 16.45 1.86
64 1000 16.54 3.69
128 1000 17.34 7.04
256 1000 17.99 13.57
512 1000 18.61 26.24
1024 1000 20.22 48.30
2048 1000 23.29 83.85
4096 1000 29.39 132.89
8192 1000 48.51 161.06
16384 1000 83.50 187.13
32768 1000 201.53 155.06
65536 640 279.01 224.01
131072 320 333.50 374.81
262144 160 167.37 1493.74
524288 80 222.23 2249.97
1048576 40 321.46 3110.78
2097152 20 534.68 3740.57
4194304 10 955.61 4185.83
# All processes entering MPI_Finalize
[cfxuser@c002-n002 ~]$
[/plain]
Here is a run with only the provider ofa-v2-mlx4_0-1 (notice low latency and low bandwidth):
[plain]
[cfxuser@c002-n002 ~]$ export I_MPI_MIC=1[cfxuser@c002-n002 ~]$ export I_MPI_FABRICS=dapl[cfxuser@c002-n002 ~]$ export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1
[cfxuser@c002-n002 ~]$ mpirun -np 1 -host mic0 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1 PingPong : -np 1 -host mic1 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1
benchmarks to run PingPong
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V3.2.4, MPI-1 part
#---------------------------------------------------
# Date : Thu Feb 27 11:05:01 2014
# Machine : k1om
# System : Linux
# Release : 2.6.38.8+mpss3.1.2
# Version : #1 SMP Wed Dec 18 19:09:36 PST 2013
# MPI Version : 2.2
# MPI Thread Environment:
# New default behavior from Version 3.2 on:
# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time
# Calling sequence was:
# /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1 PingPong
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# PingPong
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 8.06 0.00
1 1000 10.08 0.09
2 1000 9.64 0.20
4 1000 9.60 0.40
8 1000 9.61 0.79
16 1000 9.71 1.57
32 1000 11.98 2.55
64 1000 12.05 5.07
128 1000 13.31 9.17
256 1000 9.22 26.48
512 1000 11.20 43.60
1024 1000 14.86 65.74
2048 1000 22.02 88.68
4096 1000 33.90 115.24
8192 1000 49.31 158.43
16384 1000 80.52 194.05
32768 1000 147.94 211.23
65536 640 278.79 224.18
131072 320 529.51 236.07
262144 160 968.77 258.06
524288 80 1927.34 259.42
1048576 40 3810.97 262.40
2097152 20 7582.88 263.75
4194304 10 15113.25 264.67
# All processes entering MPI_Finalize
[cfxuser@c002-n002 ~]$
[/plain]
And here is a run with multiple providers that fails to switch providers (notice the low latency and low bandwidth like with only the ofa-v2-mlx4_0-1 provider). Tthe results produced by this run are identical to the results produced by the default configuration, i.e., with I_MPI_DAPL_PROVIDER_LIST not set.
[plain]
[cfxuser@c002-n002 ~]$ export I_MPI_MIC=1
[cfxuser@c002-n002 ~]$ export I_MPI_FABRICS=dapl
[cfxuser@c002-n002 ~]$ export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1,ofa-v2-scif0,ofa-v2-mcm-1
[cfxuser@c002-n002 ~]$ export I_MPI_DAPL_DIRECT_COPY_THRESHOLD=20000,200000,200000[cfxuser@c002-n002 ~]$ mpirun -np 1 -host mic0 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1 PingPong : -np 1 -host mic1 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1
benchmarks to run PingPong
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V3.2.4, MPI-1 part
#---------------------------------------------------
# Date : Thu Feb 27 10:57:43 2014
# Machine : k1om
# System : Linux
# Release : 2.6.38.8+mpss3.1.2
# Version : #1 SMP Wed Dec 18 19:09:36 PST 2013
# MPI Version : 2.2
# MPI Thread Environment:
# New default behavior from Version 3.2 on:
# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time
# Calling sequence was:
# /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1 PingPong
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# PingPong
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 13.38 0.00
1 1000 9.14 0.10
2 1000 8.70 0.22
4 1000 8.62 0.44
8 1000 8.58 0.89
16 1000 9.14 1.67
32 1000 11.79 2.59
64 1000 11.84 5.16
128 1000 13.41 9.11
256 1000 9.26 26.36
512 1000 10.83 45.09
1024 1000 14.43 67.69
2048 1000 21.14 92.38
4096 1000 32.04 121.92
8192 1000 56.98 137.10
16384 1000 88.17 177.21
32768 1000 142.84 218.77
65536 640 274.48 227.71
131072 320 513.25 243.55
262144 160 1013.34 246.71
524288 80 1982.09 252.26
1048576 40 3885.59 257.36
2097152 20 7593.22 263.39
4194304 10 14822.21 269.87
# All processes entering MPI_Finalize
[cfxuser@c002-n002 ~]$
[/plain]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Andrey,
To have the right provider selected you have to provide long names for each Intel® Xeon Phi™ Coprocessor. In your particular case the command line should look like:
mpirun -np 1 -host c002-n002-mic0 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1 PingPong : -np 1 -host c002-n002-mic1 /opt/intel/impi/4.1.1.036/mic/bin/IMB-MPI1
An alternative way to run applications is using of hostfiles. Create a hostfile:
$ cat hostfile.txt
c002-n002-mic0
c002-n002-mic1
$ export I_MPI_MIC_PREFIX=$I_MPI_ROOT/mic/bin/
$ mpirun -f hostfile.txt -ppn 1 -n 2 IMB-MPI1 PingPong
And please be careful with I_MPI_FABRICS set explicitly! I'd recommend using it by default. (or shm:dapl at least)
Best wishes!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you, Dmitry! That was it, I needed long hostnames.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page