- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi:
Our software product is based on Intel MPI for parallel computing on Windows.
Recently many of our customers encounter this error. Due to COVID-19, they all work at home with VPN connection to the office.
They run our software for parallel computing at home, but when they disconnect VPN, the parallel computing is stopped at the same time.
I can reproduce the error by the following steps:
1. Run IMB-MPI1.exe with the command: mpiexec.exe -localonly -n 4 C:\test\IMB-MPI1.exe
2. While IMB-MPI1.exe is still running, I disable any of the network interfaces (I have 3 NICs, 2 are created by VMware, 1 is physical NIC.) and got the following errors
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.05 0.06 0.06
1 1000 0.77 0.79 0.78
2 1000 0.86 0.87 0.86
4 1000 0.86 0.90 0.88
8 1000 0.78 0.80 0.79
16 1000 1.01 1.15 1.08
32 1000 1.18 1.22 1.20
64 1000 0.87 0.90 0.88
128 1000 1.31 1.36 1.34
256 1000 1.29 1.35 1.31
512 1000 1.52 1.57 1.55
1024 1000 1.45 1.47 1.46
2048 1000 2.57 2.77 2.67
4096 1000 3.60 4.05 3.88
8192 1000 4.99 5.31 5.13
16384 1000 8.44 8.74 8.52
32768 1000 14.06 14.34 14.14
65536 640 35.06 35.86 35.47
131072 320 59.00 67.31 63.18
262144 160 156.66 167.80 161.57
524288 80 869.78 896.27 880.66
1048576 40 2402.85 2564.92 2484.02
2097152 20 4692.22 4907.47 4789.06
[mpiexec@PCAcer144006] ..\hydra\pm\pmiserv\pmiserv_cb.c (863): connection to proxy 0 at host PCAcer144006 failed
[mpiexec@PCAcer144006] ..\hydra\tools\demux\demux_select.c (103): callback returned error status
[mpiexec@PCAcer144006] ..\hydra\pm\pmiserv\pmiserv_pmci.c (520): error waiting for event
[mpiexec@PCAcer144006] ..\hydra\ui\mpich\mpiexec.c (1157): process manager error waiting for completion
C:\Program Files\Intel MPI 2018\x64>
Is there any workaround ? Thank you
- Tags:
- Cluster Computing
- General Support
- Intel® Cluster Ready
- Message Passing Interface (MPI)
- Parallel Computing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
Previously we have tested with the latest IMPI version instead of IMPI 2018u5, which is why we haven't got that error.
We have tested with 2018u5 and got that error. So there seems to be a bug which has been fixed in later versions
We suggest you upgrade to the latest version of IMPI (currently 2019u7).
Regards
Prasanth
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
The error you are seeing is caused because one of the processes in your job ended incorrectly or one of your processes has had a segmentation fault[ This means reading from or writing to an area of memory that it is not permitted to].
Could you provide the logs after setting FI_LOG_LEVEL=debug I_MPI_DEBUG=5
Are you sure even after disconnecting the VPN the system can communicate with the node in which processes have been launched?
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
Are you still facing the issue?
If yes, please provide the details we have asked it will help us debug the issue.
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Prasanth:
VPN issue occurred in my client's environment. But I think it's similar to what I test in my environment.
In my own environment, I have only 1 PC, which has 1 physical network interface and 2 virtual network interface created by VMware.
I run IMB-MPI.exe and before it finishes I disable one virtual network interface, then I got the MPI error.
My question is why running with "-localonly" is related to "disabling network interface".
C:\>"C:\Program Files\Intel MPI 2018\x64\mpiexec.exe" -localonly -n 4 -genv I_MPI_DEBUG 5 -genv FI_LOG_LEVEL debug C:\test\IMB-MPI1.exe
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[2] MPI startup(): shm data transfer mode
[3] MPI startup(): shm data transfer mode
[1] MPI startup(): Internal info: pinning initialization was done
[0] MPI startup(): Internal info: pinning initialization was done
[2] MPI startup(): Internal info: pinning initialization was done
[3] MPI startup(): Internal info: pinning initialization was done
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 1124 PCAcer144006 {0,1}
[0] MPI startup(): 1 8256 PCAcer144006 {2,3}
[0] MPI startup(): 2 10636 PCAcer144006 {4,5}
[0] MPI startup(): 3 7604 PCAcer144006 {6,7}
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_PIN_MAPPING=4:0 0,1 2,2 4,3 6
#------------------------------------------------------------
# Intel (R) MPI Benchmarks 2018, MPI-1 part
#------------------------------------------------------------
# Date : Wed May 06 09:38:14 2020
# Machine : Intel(R) 64 Family 6 Model 60 Stepping 3, GenuineIntel
# Release : 6.2.9200
# Version :
# MPI Version : 3.1
# MPI Thread Environment:
# Calling sequence was:
# C:\test\IMB-MPI1.exe
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# PingPong
# PingPing
# Sendrecv
# Exchange
# Allreduce
# Reduce
# Reduce_scatter
# Allgather
# Allgatherv
# Gather
# Gatherv
# Scatter
# Scatterv
# Alltoall
# Alltoallv
# Bcast
# Barrier
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.17 0.00
1 1000 0.19 5.38
2 1000 0.23 8.74
4 1000 0.17 22.88
8 1000 0.23 34.39
16 1000 0.22 72.46
32 1000 0.23 139.98
64 1000 0.25 253.62
128 1000 0.26 483.66
256 1000 0.28 930.91
512 1000 0.31 1627.72
1024 1000 0.34 2978.48
2048 1000 0.46 4430.98
4096 1000 0.68 6066.35
8192 1000 1.23 6647.46
16384 1000 2.32 7072.43
32768 1000 3.73 8775.11
65536 640 8.29 7902.37
131072 320 15.10 8682.69
262144 160 29.76 8810.08
524288 80 53.14 9865.82
1048576 40 104.31 10052.62
2097152 20 249.32 8411.40
4194304 10 814.33 5150.65
#---------------------------------------------------
# Benchmarking PingPing
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 0.26 0.00
1 1000 0.51 1.96
2 1000 0.26 7.68
4 1000 0.26 15.23
8 1000 0.26 30.77
16 1000 0.35 45.18
32 1000 0.30 107.02
64 1000 0.28 225.19
128 1000 0.38 340.52
256 1000 0.39 659.62
512 1000 0.42 1221.67
1024 1000 0.44 2319.89
2048 1000 0.69 2974.15
4096 1000 0.90 4527.97
8192 1000 1.33 6174.72
16384 1000 2.62 6245.33
32768 1000 4.79 6843.49
65536 640 11.94 5487.05
131072 320 19.68 6660.06
262144 160 36.10 7261.10
524288 80 67.10 7813.10
1048576 40 131.38 7981.40
2097152 20 439.42 4772.55
4194304 10 1596.74 2626.79
#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 0.40 0.40 0.40 0.00
1 1000 0.39 0.39 0.39 5.18
2 1000 0.67 0.67 0.67 6.01
4 1000 0.38 0.38 0.38 21.29
8 1000 0.40 0.40 0.40 40.26
16 1000 0.45 0.45 0.45 71.06
32 1000 0.46 0.46 0.46 138.20
64 1000 0.46 0.46 0.46 279.05
128 1000 1.05 1.05 1.05 244.00
256 1000 3.01 3.06 3.04 167.15
512 1000 0.58 0.58 0.58 1780.87
1024 1000 0.93 0.93 0.93 2204.76
2048 1000 0.99 0.99 0.99 4133.62
4096 1000 1.02 1.02 1.02 8014.09
8192 1000 1.62 1.62 1.62 10091.78
16384 1000 3.03 3.03 3.03 10807.74
32768 1000 4.29 4.29 4.29 15290.71
65536 640 11.81 11.81 11.81 11094.87
131072 320 37.92 37.92 37.92 6913.37
262144 160 74.20 74.20 74.20 7065.64
524288 80 72.93 72.93 72.93 14377.10
1048576 40 127.65 127.66 127.65 16427.96
2097152 20 399.46 399.52 399.49 10498.49
4194304 10 1420.54 1420.81 1420.68 5904.10
#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 4
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 0.50 0.50 0.50 0.00
1 1000 0.45 0.45 0.45 4.45
2 1000 0.47 0.47 0.47 8.52
4 1000 0.45 0.45 0.45 17.91
8 1000 0.59 0.59 0.59 27.16
16 1000 0.46 0.46 0.46 69.04
32 1000 0.46 0.46 0.46 139.01
64 1000 0.48 0.48 0.48 267.06
128 1000 0.60 0.60 0.60 425.60
256 1000 0.59 0.59 0.59 861.08
512 1000 0.66 0.66 0.66 1540.31
1024 1000 0.71 0.71 0.71 2881.67
2048 1000 0.86 0.86 0.86 4766.67
4096 1000 1.16 1.17 1.17 7011.30
8192 1000 1.76 1.76 1.76 9288.51
16384 1000 2.89 2.89 2.89 11345.08
32768 1000 4.66 4.66 4.66 14051.76
65536 640 18.42 18.44 18.43 7109.17
131072 320 33.06 33.12 33.09 7915.72
262144 160 60.25 60.41 60.33 8678.74
524288 80 108.77 109.02 108.90 9618.09
1048576 40 336.31 340.51 338.59 6158.81
2097152 20 1149.49 1168.46 1159.12 3589.62
4194304 10 2544.51 2626.74 2585.81 3193.54
#-----------------------------------------------------------------------------
# Benchmarking Exchange
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 0.87 0.87 0.87 0.00
1 1000 0.60 0.60 0.60 6.72
2 1000 0.68 0.68 0.68 11.68
4 1000 0.60 0.60 0.60 26.74
8 1000 0.57 0.57 0.57 55.99
16 1000 0.69 0.69 0.69 93.00
32 1000 0.65 0.65 0.65 198.30
64 1000 0.69 0.69 0.69 371.66
128 1000 0.88 0.88 0.88 584.01
256 1000 0.87 0.87 0.87 1172.30
512 1000 0.94 0.94 0.94 2175.02
1024 1000 1.00 1.00 1.00 4086.19
2048 1000 1.29 1.29 1.29 6335.65
4096 1000 1.74 1.74 1.74 9401.50
8192 1000 2.80 2.80 2.80 11708.71
16384 1000 5.07 5.07 5.07 12936.18
32768 1000 8.74 8.74 8.74 14990.79
65536 640 24.43 24.43 24.43 10729.59
131072 320 33.24 33.24 33.24 15774.44
262144 160 72.37 72.38 72.37 14487.97
524288 80 136.95 136.97 136.96 15311.17
1048576 40 294.08 294.14 294.11 14259.43
2097152 20 772.77 772.93 772.85 10852.93
4194304 10 2644.70 2645.16 2644.93 6342.61
#-----------------------------------------------------------------------------
# Benchmarking Exchange
# #processes = 4
#-----------------------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec
0 1000 1.03 1.03 1.03 0.00
1 1000 1.00 1.01 1.01 3.98
2 1000 1.13 1.13 1.13 7.07
4 1000 1.04 1.04 1.04 15.32
8 1000 0.97 0.97 0.97 33.10
16 1000 1.00 1.00 1.00 64.13
32 1000 1.12 1.12 1.12 113.85
64 1000 1.10 1.10 1.10 232.35
128 1000 1.29 1.29 1.29 396.81
256 1000 1.24 1.25 1.24 822.16
512 1000 1.33 1.34 1.33 1533.97
1024 1000 1.48 1.48 1.48 2758.99
2048 1000 1.81 1.81 1.81 4530.22
4096 1000 2.92 2.92 2.92 5615.00
8192 1000 3.98 3.98 3.98 8228.41
16384 1000 6.92 6.93 6.93 9461.09
32768 1000 13.41 13.42 13.41 9769.97
65536 640 25.53 25.55 25.54 10259.79
131072 320 43.11 43.13 43.12 12156.87
262144 160 75.47 75.59 75.53 13871.89
524288 80 198.98 199.03 199.01 10536.93
1048576 40 635.41 636.96 636.41 6584.85
2097152 20 1987.81 1993.80 1991.49 4207.35
4194304 10 4715.18 4717.40 4716.28 3556.45
#----------------------------------------------------------------
# Benchmarking Allreduce
# #processes = 2
# ( 2 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0 1000 0.05 0.05 0.05
4 1000 0.41 0.42 0.42
8 1000 0.42 0.43 0.42
16 1000 0.44 0.44 0.44
32 1000 0.43 0.44 0.44
64 1000 0.44 0.44 0.44
128 1000 0.59 0.60 0.59
256 1000 0.59 0.60 0.60
512 1000 0.66 0.72 0.69
1024 1000 0.73 0.80 0.77
2048 1000 0.87 0.91 0.89
4096 1000 1.22 1.25 1.23
8192 1000 2.30 2.37 2.34
16384 1000 3.96 4.02 3.99
32768 1000 7.19 7.19 7.19
65536 640 11.92 11.96 11.94
131072 320 63.15 63.70 63.43
262144 160 71.74 72.39 72.06
524288 80 126.79 127.65 127.22
1048576 40 459.69 469.03 464.36
[mpiexec@PCAcer144006] ..\hydra\pm\pmiserv\pmiserv_cb.c (863): connection to proxy 0 at host PCAcer144006 failed
[mpiexec@PCAcer144006] ..\hydra\tools\demux\demux_select.c (103): callback returned error status
[mpiexec@PCAcer144006] ..\hydra\pm\pmiserv\pmiserv_pmci.c (520): error waiting for event
[mpiexec@PCAcer144006] ..\hydra\ui\mpich\mpiexec.c (1157): process manager error waiting for completion
C:\>
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
We want to replicate your scenario at our end.
Could you please provide the following details:
1)VMware version
2) OS version running in VMware
3)Network Adapter details
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Prasanth:
1. VMWare version:
2. OS running in VMware are : CentOS 7.2 x64 and CentOS 8.1 x64 (both are powered off when running IMB-MPI1.exe)
3. Network adpater details:
(a) VMnet1
(b) VMnet8
(c) Physical NIC
regards,
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
We have tried to replicate your issue and here are the setup details:
1) VMWare workstation pro 15
2) 2 CentOS 8.1 VMs
3) For Virtual Network Interface we have set the Custom: Specific virtual network to VMnet0
We have tried to reproduce the error by recreating 3 scenarios based on your inputs
1) Disconnect VPN while the benchmark is running.
2) Disable/ switch off the Virtual network interface while the benchmark is running.
(We have disabled the network adapter in Control Panel\Network and Internet\Network Connections, also we switched off the Virtual machine as another case)
3) Disconnect VPN and Disable/ switch off the Virtual network interface while the benchmark is running.
We haven't got any errors in either of those scenarios.
We are not sure what we are missing here, can you give us any more inputs?
Thanks
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Prasanth:
I uninstall VMware. Now there is only 1 physical network adapter in my computer.
I reinstall Intel MPI 2018 update 5 (w_mpi_p_2018.5.287.exe)
I cd to C:\Program Files (x86)\IntelSWTools\mpi\2018.5.287\intel64\bin
I run:
mpiexec.exe -localonly -n 4 IMB-MPI1.exe
During IMB-MPI1.exe is running, I disable the physical network adapter, and IMB-MPI1.exe is aborted with same errors.
The video: https://test-bucket-3a7c4c82.s3.amazonaws.com/mpi.mp4
regards,
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
Previously we have tested with the latest IMPI version instead of IMPI 2018u5, which is why we haven't got that error.
We have tested with 2018u5 and got that error. So there seems to be a bug which has been fixed in later versions
We suggest you upgrade to the latest version of IMPI (currently 2019u7).
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
Could you please confirm that updating the IMPI resolved your issue.
Else please let us know the problem you were facing.
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Prasanth:
Intel MPI 2019 update 7 indeed resolves the VPN issue.
Thank you.
regards,
Jim
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page