- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm getting bad performance with MPI barriers in a microbenchmark on this system configuration:
- multiple Xeon Phi coprocessors
- Intel MPSS 3.5 (April 2015), Linux
- Intel MPI 5.0 update 3
- OFED-3.12-1
export I_MPI_MIC=1 export I_MPI_DEBUG=5 export I_MPI_FABRICS=shm:dapl export I_MPI_DAPL_PROVIDER=ofa-v2-scif0 export I_MPI_PIN_MODE=lib export I_MPI_PIN_CELL=core /opt/intel/impi/5.0.3.048/intel64/bin/mpirun -hosts mic0,mic1 -ppn 30 -n 60 ./exe (( omitted tons of debug lines: DAPL and processor pinning are occurring correctly )) [0] MPI startup(): I_MPI_DAPL_PROVIDER=ofa-v2-scif0 [0] MPI startup(): I_MPI_DEBUG=5 [0] MPI startup(): I_MPI_FABRICS=shm:dapl [0] MPI startup(): I_MPI_MIC=1 [0] MPI startup(): I_MPI_PIN_MAPPING=30:0 1,1 9,2 17,3 25,4 33,5 41,6 49,7 57,8 65,9 73,10 81,11 89,12 97,13 105,14 113,15 121,16 129,17 137,18 145,19 153,20 161,21 169,22 177,23 185,24 193,25 201,26 209,27 217,28 225,29 0 # OSU MPI Barrier Latency Test # Avg Latency(us) 1795.31
I'm not sure if these results are representative of the mpirun configuration I've used (two coprocessors, ppn=30).
Additional results (ppn=60, 120 PEs):
/opt/intel/impi/5.0.3.048/intel64/bin/mpirun -hosts mic0,mic1 -ppn 60 -n 120 ./exe # OSU MPI Barrier Latency Test # Avg Latency(us) 5378.48
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Artem. This issue is related to an Intel Premier support ticket with Issue ID: 6000114596.
I was able to get a sufficient resolution by using I_MPI_ADJUST_BARRIER=4 # topology-aware recursive-doubling algorithm, but this issue may be more intrusive to the Xeon Phi than just using this environment variable. Other algorithms or operations may also be affected.
On related note, can you also look at the performance for:
mpirun -hosts mic0 -np 30 IMB-RMA All_put_all -npmin 30
See this thread: https://software.intel.com/en-us/forums/topic/562296
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
More system details. No InfiniBand card in this system. Should the link_layer for scif0 on the host report Ethernet or InfiniBand for this system? Most examples I can find show link_layer reporting InfiniBand for scif0, but those were systems with a Mellanox card (mlx4_0).
host$ ibv_devinfo hca_id: scif0 transport: iWARP (1) fw_ver: 0.0.1 node_guid: 4c79:baff:fe34:0f11 sys_image_guid: 4c79:baff:fe34:0f11 vendor_id: 0x8086 vendor_part_id: 0 hw_ver: 0x1 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 1000 port_lmc: 0x00 link_layer: Ethernet host$ ssh mic0 ibv_devinfo hca_id: scif0 transport: SCIF (2) fw_ver: 0.0.1 node_guid: 4c79:baff:fe34:0f10 sys_image_guid: 4c79:baff:fe34:0f10 vendor_id: 0x8086 vendor_part_id: 0 hw_ver: 0x1 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 1001 port_lmc: 0x00 link_layer: SCIF
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I hate to admit it but I had never noticed that in some cases, the link layer shows up as InfiniBand and in other cases, as Ethernet. It is not just a matter of there being an adapter or not.
While I am thinking about this (or until someone more knowledgeable chimes in), have you tried running the benchmarks Intel includes with the Intel MPI? How did those perform? Is it only the barrier test that is a problem?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Looking over the examples again, there doesn't appear to be the correlation that I thought there was with scif0's link_layer appearing as InfiniBand due to the presence of a hardware card. In any case, it's not much of an issue if it isn't causing this barrier performance anomaly. Just something I thought was interesting and wanted to see if others knew more about.
I hadn't considered the Intel MPI benchmarks, so I ran some numbers at your suggestion (-hosts mic0,mic1 -ppn 30 -n 60):
host$ mpirun -hosts mic0,mic1 -ppn 30 -n 60 \ /opt/intel/impi/5.0.3.048/mic/bin/IMB-MPI1 Barrier [0] MPI startup(): Multi-threaded optimized library [58] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0 ... omitted [8] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0 [34] MPI startup(): DAPL provider ofa-v2-scif0 [45] MPI startup(): DAPL provider ofa-v2-scif0 [44] MPI startup(): DAPL provider ofa-v2-scif0 [34] MPI startup(): shm and dapl data transfer modes [44] MPI startup(): shm and dapl data transfer modes [45] MPI startup(): shm and dapl data transfer modes ... omitted [31] MPI startup(): DAPL provider ofa-v2-scif0 [46] MPI startup(): DAPL provider ofa-v2-scif0 [39] MPI startup(): DAPL provider ofa-v2-scif0 [5] MPI startup(): shm and dapl data transfer modes [6] MPI startup(): shm and dapl data transfer modes [11] MPI startup(): shm and dapl data transfer modes [31] MPID_nem_init_dapl_coll_fns(): User set DAPL collective mask = 0000 [31] MPID_nem_init_dapl_coll_fns(): Effective DAPL collective mask = 0000 ... omitted [0] MPI startup(): Rank Pid Node name Pin cpu ... affinity setting [0] MPI startup(): I_MPI_DAPL_PROVIDER=ofa-v2-scif0 [0] MPI startup(): I_MPI_DEBUG=5 [0] MPI startup(): I_MPI_FABRICS=shm:dapl [0] MPI startup(): I_MPI_MIC=1 [0] MPI startup(): I_MPI_PIN_MAPPING=30:0 1,1 9,2 17,3 25,4 33,5 41,6 49,7 57,8 65,9 73,10 81,11 89,12 97,13 105,14 113,15 121,16 129,17 137,18 145,19 153,20 161,21 169,22 177,23 185,24 193,25 201,26 209,27 217,28 225,29 0 benchmarks to run Barrier #------------------------------------------------------------ # Intel (R) MPI Benchmarks 4.0 Update 1, MPI-1 part #------------------------------------------------------------ # Date : Mon Jun 29 19:38:26 2015 # Machine : k1om # System : Linux # Release : 2.6.38.8+mpss3.5 # Version : #1 SMP Thu Apr 2 02:17:02 PDT 2015 # MPI Version : 3.0 # MPI Thread Environment: # New default behavior from Version 3.2 on: # the number of iterations per message size is cut down # dynamically when a certain run time (per message size sample) # is expected to be exceeded. Time limit is defined by variable # "SECS_PER_SAMPLE" (=> IMB_settings.h) # or through the flag => -time # Calling sequence was: # /opt/intel/impi/5.0.3.048/mic/bin/IMB-MPI1 Barrier # Minimum message length in bytes: 0 # Maximum message length in bytes: 4194304 # # MPI_Datatype : MPI_BYTE # MPI_Datatype for reductions : MPI_FLOAT # MPI_Op : MPI_SUM # # # List of Benchmarks to run: # Barrier #--------------------------------------------------- # Benchmarking Barrier # #processes = 2 # ( 58 additional processes waiting in MPI_Barrier) #--------------------------------------------------- #repetitions t_min[usec] t_max[usec] t_avg[usec] 1000 4.63 4.64 4.63 #--------------------------------------------------- # Benchmarking Barrier # #processes = 4 # ( 56 additional processes waiting in MPI_Barrier) #--------------------------------------------------- #repetitions t_min[usec] t_max[usec] t_avg[usec] 1000 10.53 10.53 10.53 #--------------------------------------------------- # Benchmarking Barrier # #processes = 8 # ( 52 additional processes waiting in MPI_Barrier) #--------------------------------------------------- #repetitions t_min[usec] t_max[usec] t_avg[usec] 1000 15.86 15.87 15.87 #--------------------------------------------------- # Benchmarking Barrier # #processes = 16 # ( 44 additional processes waiting in MPI_Barrier) #--------------------------------------------------- #repetitions t_min[usec] t_max[usec] t_avg[usec] 1000 21.68 21.69 21.68 #--------------------------------------------------- # Benchmarking Barrier # #processes = 32 # ( 28 additional processes waiting in MPI_Barrier) #--------------------------------------------------- #repetitions t_min[usec] t_max[usec] t_avg[usec] 1000 445.00 445.35 445.27 #--------------------------------------------------- # Benchmarking Barrier # #processes = 60 #--------------------------------------------------- #repetitions t_min[usec] t_max[usec] t_avg[usec] 1000 1875.15 1877.34 1876.12 # All processes entering MPI_Finalize
At n=60, Intel MPI is reporting 1876 us latency, approximately the same as the results reported above. My baselines for comparison include MPICH 3.1.4 using ch3:nemesis:scif (280 us latency) and my own OpenSHMEM library that I've been developing for multiple Xeon Phi coprocessors leveraging SCIF (14 us).
So far, the barrier performance seems to be the only one out of line with expectations. Here are some of the other results from IMB_MPI1. If it helps, the MPI library and compilers are from Parallel Studio XE Cluster Edition Update 3. Thanks for your help!
Runtime Configuration: -hosts mic0,mic1 -ppn 1 -n 2 #--------------------------------------------------- # Benchmarking PingPong # #processes = 2 #--------------------------------------------------- #bytes #repetitions t[usec] Mbytes/sec 0 1000 21.58 0.00 1 1000 21.67 0.04 2 1000 16.63 0.11 4 1000 16.73 0.23 8 1000 16.70 0.46 16 1000 16.82 0.91 32 1000 17.78 1.72 64 1000 17.86 3.42 128 1000 18.53 6.59 256 1000 19.21 12.71 512 1000 20.15 24.23 1024 1000 21.75 44.90 2048 1000 24.78 78.81 4096 1000 31.99 122.12 8192 1000 164.00 47.64 16384 1000 192.12 81.33 32768 1000 165.40 188.94 65536 640 227.15 275.14 131072 320 591.85 211.20 262144 160 240.43 1039.81 524288 80 287.78 1737.44 1048576 40 409.79 2440.29 2097152 20 630.00 3174.62 4194304 10 969.10 4127.54 #----------------------------------------------------------------------------- # Benchmarking Sendrecv # #processes = 2 #----------------------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec 0 1000 23.97 23.98 23.98 0.00 1 1000 24.82 24.82 24.82 0.08 2 1000 24.80 24.81 24.80 0.15 4 1000 24.85 24.86 24.86 0.31 8 1000 24.87 24.88 24.88 0.61 16 1000 24.88 24.88 24.88 1.23 32 1000 26.83 26.84 26.83 2.27 64 1000 26.84 26.84 26.84 4.55 128 1000 27.49 27.49 27.49 8.88 256 1000 28.36 28.37 28.37 17.21 512 1000 29.42 29.42 29.42 33.20 1024 1000 30.99 30.99 30.99 63.03 2048 1000 33.95 33.96 33.95 115.02 4096 1000 41.27 41.28 41.27 189.27 8192 1000 200.06 200.08 200.07 78.09 16384 1000 245.54 245.56 245.55 127.26 32768 1000 202.43 202.44 202.43 308.74 65536 640 179.31 179.34 179.32 697.02 131072 320 194.58 194.61 194.59 1284.65 262144 160 266.17 266.23 266.20 1878.06 524288 80 333.99 334.20 334.09 2992.22 1048576 40 463.47 463.65 463.56 4313.57 2097152 20 608.85 610.26 609.55 6554.62 4194304 10 1024.10 1024.51 1024.31 7808.62 #----------------------------------------------------------------------------- # Benchmarking Exchange # #processes = 2 #----------------------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec 0 1000 48.23 48.24 48.24 0.00 1 1000 49.82 49.84 49.83 0.08 2 1000 49.89 49.89 49.89 0.15 4 1000 49.92 49.93 49.93 0.31 8 1000 50.07 50.07 50.07 0.61 16 1000 50.08 50.09 50.09 1.22 32 1000 53.63 53.65 53.64 2.28 64 1000 53.88 53.88 53.88 4.53 128 1000 55.43 55.44 55.43 8.81 256 1000 56.83 56.84 56.84 17.18 512 1000 58.72 58.73 58.72 33.26 1024 1000 61.50 61.52 61.51 63.50 2048 1000 67.96 67.97 67.97 114.94 4096 1000 82.46 82.49 82.48 189.41 8192 1000 406.44 406.45 406.44 76.89 16384 1000 447.09 447.11 447.10 139.79 32768 1000 415.22 415.24 415.23 301.03 65536 640 363.31 363.32 363.32 688.09 131072 320 400.70 400.73 400.72 1247.71 262144 160 547.68 547.73 547.70 1825.71 524288 80 676.29 676.60 676.44 2955.96 1048576 40 938.20 938.27 938.24 4263.15 2097152 20 1301.10 1302.55 1301.83 6141.79 4194304 10 2078.39 2080.80 2079.59 7689.36 #---------------------------------------------------------------- # Benchmarking Allreduce # #processes = 2 #---------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 0.55 0.59 0.57 4 1000 32.70 32.70 32.70 8 1000 32.64 32.64 32.64 16 1000 32.76 32.76 32.76 32 1000 34.72 34.73 34.72 64 1000 34.96 34.97 34.97 128 1000 37.40 37.41 37.41 256 1000 38.32 38.32 38.32 512 1000 39.54 39.54 39.54 1024 1000 41.38 41.39 41.38 2048 1000 45.91 45.91 45.91 4096 1000 55.61 55.61 55.61 8192 1000 103.17 103.18 103.18 16384 1000 385.73 385.75 385.74 32768 1000 742.72 742.75 742.73 65536 640 1282.00 1282.04 1282.02 131072 320 2346.32 2346.40 2346.36 262144 160 4770.80 4770.97 4770.89 524288 80 8553.10 8553.42 8553.26 1048576 40 17021.97 17022.68 17022.32 2097152 20 7646.10 7658.95 7652.52 4194304 10 12146.31 12148.62 12147.46 #--------------------------------------------------- # Benchmarking Barrier # #processes = 2 #--------------------------------------------------- #repetitions t_min[usec] t_max[usec] t_avg[usec] 1000 26.09 26.09 26.09 [other tests available on request]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried changing the Intel compiler suite I was using (completely uninstalled one and installed the other). These are the barrier-latency numbers I am getting for the two suites. (See EDIT 1.)Notably, I tested Intel MPI 4.1.3.04{8,9} and Intel MPI 5.0.3.048 with BOTH suites and it did not affect the numbers at all. This leads me to believe that Intel MPI is not causing this performance difference but is something else installed alongside the suite.
mpirun options Intel MPI 4.1.3.049 Intel MPI 5.0.3.048 -hosts mic0,mic1 -ppn 30 -n 60 215.84 microseconds 1848.86 -hosts mic0,mic2 -ppn 30 -n 60 214.67 1906.14 -hosts mic0,mic3 -ppn 30 -n 60 283.05 1828.22 -hosts mic0,mic1 -ppn 60 -n 120 4828.81 4733.04 -hosts mic0,mic2 -ppn 60 -n 120 4676.96 4636.11 -hosts mic0,mic3 -ppn 60 -n 120 4588.63 4304.11 -hosts mic0,mic1,mic2,mic3 -ppn 15 -n 60 229.26 920.36 -hosts mic0,mic1,mic2,mic3 -ppn 60 -n 240 436.72 6099.88
I also tested with OFED 3.12-1 and OFED 3.18-rc3 with no difference in performance to the numbers above.
Here's some system information:
CentOS 6.6 kernel version 2.6.32-504.23.4.el6.x86_64 Dual-socket Super Micro motherboard (X9DRG-QF) Two Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz Four Xeon Phi 5110P coprocessors mic0 and mic1 are separated from mic2 and mic3 via Intel QPI through the host processors.
I'm talking with Steve from Intel Premier Support and he mentioned his system configuration is performing at 292us (-ppn 30 -n 60) and 421us (-ppn 60 -n 120) which is much closer to the numbers I would expect. Any further suggestions would be greatly appreciated!
EDIT 1 (2015-07-05-00-46 EST): I uninstalled both development suites and tested only Intel MPI using the runtime distributables and the included IMB-MPI1. The respective Intel MPI versions from the two development suite match the results in the table above. At least now I've traced the performance issues to the Intel MPI versions specifically.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Bryant,
I've reproduced the performance regress on my system with IMB barrier.
I'll create an internal tracker for the issue.
Thanks for the reporting!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Artem. This issue is related to an Intel Premier support ticket with Issue ID: 6000114596.
I was able to get a sufficient resolution by using I_MPI_ADJUST_BARRIER=4 # topology-aware recursive-doubling algorithm, but this issue may be more intrusive to the Xeon Phi than just using this environment variable. Other algorithms or operations may also be affected.
On related note, can you also look at the performance for:
mpirun -hosts mic0 -np 30 IMB-RMA All_put_all -npmin 30
See this thread: https://software.intel.com/en-us/forums/topic/562296
![](/skins/images/54BF544B471F3F61DFD338F1D58F9426/responsive_peak/images/icon_anonymous_message.png)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page