- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm getting bad performance with MPI_Put (and MPI_Get) in IMB-RMA All_put_all microbenchmark on this system configuration:
- Single and multiple Xeon Phi coprocessors
- Intel MPSS 3.5.1 (June 2015), Linux
- Intel MPI Library 5.1.0.079
- OFED-3.12-1 or OFED-3.18-rc3 (It doesn't really matter.)
Intel MPI runtime environment variables:
export I_MPI_MIC=1 export I_MPI_DEBUG=5 export I_MPI_FABRICS=shm:dapl export I_MPI_DAPL_PROVIDER=ofa-v2-scif0 export I_MPI_PIN_MODE=lib export I_MPI_PIN_CELL=core export I_MPI_ADJUST_BARRIER=4
My results are below. IMB-RMA All_get_all is even worse. 209616.98 microseconds for 1-byte all_get_all.
$ mpirun -hosts mic0 -n 30 IMB-RMA All_put_all -npmin 30 bytes Intel MPI 5.1 MPICH 3.1.4 1 167157.12 1645.26 2 156630.21 1638.76 4 170201.55 1744.68 8 157363.03 1795.65 16 167803.04 1918.96 32 167826.86 1421.19 64 168686.47 1852.14 128 168729.71 2477.16 256 177143.31 1922.09 512 175115.94 2242.02 1024 160964.3 2603.5 2048 162915.96 3565.54 4096 178165.97 7120.21 8192 148391.07 9664.84 16384 ((timeout)) 5854.41 32768 8571.92 65536 9698 131072 16402.45 262144 35356.35 524288 82430.8 1048576 137650.29 2097152 275713.1 4194304 430387.23
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi all
sorry for delay in responce (vacation :) )
so, could you try to set env variable
I_MPI_SCALABLE_OPTIMIZATION=off
and run benchmarks again?
thank you
--Sergey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
More numbers, this time with multiple coprocessors. My applications are trying to do -ppn 60, but cannot due to the extremely high latency for this communication pattern. The numbers here are with -ppn 10. The average time is also suspiciously low given the maximum time and the number of iterations.
/opt/intel/impi/5.1.0.079/bin64/mpirun -hosts mic0,mic1 -ppn 10 -n 20 /opt/intel/impi/5.1.0.079/mic/bin/IMB-RMA All_put_all -npmin 20 #---------------------------------------------------------------- # Benchmarking All_put_all # #processes = 20 #---------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 9.01 9.03 9.01 1 1 6044455.05 25502341.03 12162591.02 2 1 6043900.01 63057655.10 14085288.83 4 1 6047574.04 119880264.04 16803435.92 8 time-out.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What do you get if you set no environment variables other than I_MPI_MIC=1? It strikes me as very strange that the times are not just long but keep bouncing around the same values.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Performance is about the same with these environment variables:
export I_MPI_MIC=1 export I_MPI_DAPL_PROVIDER=ofa-v2-scif0
I require I_MPI_DAPL_PROVIDER or errors with SCIF connection occur due to the presence of a recently installed Mellanox InfiniBand ConnectX card.
... mic1.hostname:MCM:1437:ab2e7b40: 3813 us(3813 us): scif_connect() to port 68, failed with error Connection refused mic1.hostname:MCM:1437:ab2e7b40: 3879 us(66 us): open_hca: SCIF init ERR for mlx4_0 ...
These are the results:
/opt/intel/impi/5.1.0.079/bin64/mpirun -hosts mic0,mic1 -ppn 10 -n 20 /opt/intel/impi/5.1.0.079/mic/bin/IMB-RMA All_put_all -npmin 20 #---------------------------------------------------------------- # Benchmarking All_put_all # #processes = 20 #---------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 9.00 9.02 9.01 1 1 5862047.91 16005118.13 11632434.27 2 1 6052767.04 51715856.08 13243087.48 4 1 5856038.09 79225661.04 14525348.28
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Bryant,
Have you already created an Intel Premier support ticket for this issue?
Also could you please provide your MPICH 3.1.4 building options (it may help to reproduce the performance results).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Artem. I did create an Intel Premier support ticket (Issue ID: 6000118541). The support ticket has an Excel spreadsheet attached to it that has more information about build options, system setup, etc.
MPICH build options were:
mkdir build && cd build PREFIX=/opt/xeon-phi/mpich ../configure CC="icc -mmic" CXX="icpc -mmic" FC="ifort -mmic" F77="ifort -mmic" \ LIBS="-lscif" \ --prefix=$PREFIX --host=x86_64-k1om-linux \ --disable-fortran --disable-cxx --disable-romio \ --with-device=ch3:nemesis:scif,tcp
Note that it doesn't matter which NETMOD is used. Both SCIF and TCP perform the same for the provided intra-coprocessor results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi all
sorry for delay in responce (vacation :) )
so, could you try to set env variable
I_MPI_SCALABLE_OPTIMIZATION=off
and run benchmarks again?
thank you
--Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi
glad to see this trick helped. unfortunately some 'optimizations' may negatively affect to performance of some parts of library. we will look how to improve performance of IMPI.
thank you for help
--Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Sergey. Results are a lot better now. Any other tips are appreciated!
# mic0 only #---------------------------------------------------------------- # Benchmarking All_put_all # #processes = 20 #---------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 9.01 9.28 9.04 1 1000 1958.35 6149.54 5213.35 2 1000 1933.73 5842.54 4912.68 4 1000 1844.58 4922.10 4166.31 8 1000 813.38 5254.21 3675.51 16 1000 1990.48 4291.77 3553.58 32 1000 2118.31 4424.42 4174.93 64 1000 1914.85 5242.47 4364.39 128 1000 2135.56 5652.61 4803.70 256 1000 496.82 2641.65 1850.21 512 1000 458.69 3307.59 2266.34 1024 1000 409.40 3778.35 2688.41 2048 1000 726.40 5068.91 4359.04 4096 1000 635.39 5475.80 3997.22 8192 1000 438.93 7427.49 4597.22 16384 1000 835.50 5956.76 4511.96 32768 1000 2124.04 8104.35 5886.51 65536 640 1669.24 6120.93 4126.35 131072 320 5436.16 11125.58 8086.85 262144 160 6827.09 16458.80 12331.01 524288 80 16029.50 29737.80 23515.52 1048576 40 29509.77 56208.28 44146.67 2097152 20 45546.20 102373.10 76828.56 4194304 10 106930.99 199623.89 150745.65
# mic0 and mic1, ppn=10 #---------------------------------------------------------------- # Benchmarking All_put_all # #processes = 20 #---------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 9.01 9.04 9.02 1 1000 938.84 2009.12 1647.31 2 1000 1263.50 3198.81 2661.80 4 1000 737.21 2470.52 2001.17 8 1000 1114.56 3115.82 2308.15 16 1000 1005.04 2620.89 2126.73 32 1000 1118.84 3075.02 2413.24 64 1000 467.71 2009.17 1557.10 128 1000 641.46 2397.41 1531.80 256 1000 801.47 2340.76 2061.25 512 1000 706.79 3172.71 1982.05 1024 1000 999.00 3606.51 2265.85 2048 1000 1646.90 3568.09 2741.84 4096 1000 2990.36 5275.52 4402.02 8192 405 15819.40 25067.44 21854.30 16384 405 15215.45 24422.11 21221.95 32768 387 15266.97 26824.13 23057.98 65536 363 18548.76 28672.06 24969.52 131072 280 26215.21 36583.42 31850.96 262144 160 33443.30 59622.76 49245.38 524288 80 26353.59 47578.89 40713.24 1048576 40 27929.85 88806.33 76020.28 2097152 20 42961.11 112424.41 92187.03 4194304 10 77796.48 202296.09 159834.60
EDIT 1 (2015-08-12-10-05): Updated results for mic0,mic1 run.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page