Community
cancel
Showing results for 
Search instead for 
Did you mean: 
60 Views

Internal MPI Errors

Hi all,

I'm using Intel MPI Library 2017 Update 1 (v.2017.1.143) for Windows on Windows Server 2012 R2 Standard 64-bit nodes. I'm using 2 identical nodes and each node has the following specs:

CPU:  Intel Xeon CPU E5-2450 v2 @ 2.50GHz

RAM: DDR3 49086 MBytes Triple Channels (800 Mhz.)

GPU: NVIDIA Tesla K40c (driver version 24.21.14.1229)

Network: Mellanox ConnectX-3 Pro Ethernet Adapter (2)

  • Driver: Mellanox Infiniband 40Gbit ConnectX 3 Pro HBA driver, Version 5.10 (MLNX_VPI_WinOF-5_10_All_win2012R2_x64.exe)

I'm using fabrics as dapl:dapl or shm:tcp. Dapl version is "DAPL-ND - DAPL NetworkDirect Stand Alone installer v1.4.5    [06-02-2016]"

When using shm:tcp, I'm getting "read from socket error". Here is the full trace:

mpiexec -l -genv I_MPI_FABRICS shm:tcp -genv I_MPI_PIN_DOMAIN=omp -genv I_MPI_WAIT_MODE=1 -genv I_MPI_DEBUG=1000 -n 1 -host 10.0.0.1 ../ped/Release/Cape.exe master --do stitching --dted --exposure --config-path ..\input\cape\cape_fl_veo_50_dual.xml --mqtt-ip 10.4.1.121 --mqtt-port 1883 --mqtt-id CAPE50 --dds-id 121 -i 10.4.1.* --gcp-db-ip 10.4.1.122 --group-name EO50 : -n 1 -host 10.0.0.1 ../ped/Release/Cape.exe slave --do stabilization --group-name EO50 : -n 1 -host 10.0.0.1 ../ped/Release/Cape.exe slave --do bgsubtraction --group-name EO50 : -n 1 -host 10.0.0.1 ../ped/Release/Cape.exe slave --do tracking --group-name EO50 : -n 1 -host 10.0.0.2 ../ped/Release/Cape.exe slave -d -o dds file --group-name EO50 : -n 1 -host 10.0.0.1 ../ped/Release/Cape.exe master --do stitching --exposure --config-path ..\input\cape\cape_fl_veo_100_dual.xml --mqtt-ip 10.4.1.121 --mqtt-port 1883 --mqtt-id CAPE100 --dds-id 121 -i 10.4.1.* --gcp-db-ip 10.4.1.122 --group-name EO100 : -n 1 -host 10.0.0.1 ../ped/Release/Cape.exe slave --do stabilization --group-name EO100 : -n 1 -host 10.0.0.1 ../ped/Release/Cape.exe slave --do bgsubtraction --group-name EO100 : -n 1 -host 10.0.0.1 ../ped/Release/Cape.exe slave --do tracking --group-name EO100 : -n 1 -host 10.0.0.2 ../ped/Release/Cape.exe slave -d -o dds file --group-name EO100 : -n 1 -host 10.0.0.1 ../ped/Release/CameraController.exe --config-path ..\input\cc\cc_veo.xml --mqtt-ip 10.4.1.121 --mqtt-port 1883 --mqtt-id CC : -n 1 -host 10.0.0.1 ../ped/Release/CameraGroupProxy.exe --config-path ..\input\cgp\configuration.xml --mqtt-ip 10.4.1.121 --mqtt-port 1883 --mqtt-id CGP : -n 1 -host 10.0.0.2 ../ped/Release/GroupMetadataSynchronizer.exe --config-path ..\input\gms\configuration.xml --mqtt-ip 10.4.1.121 --mqtt-port 1883 --mqtt-id GMS --dds-reader-id 121 --dds-reader-allow-interface 10.4.1.* --dds-writer-id 122 --dds-writer-allow-interface 10.4.1.*
[11] WARNING: Logging before InitGoogleLogging() is written to STDERR
[11] I0430 19:24:42.494819 13396 ArgParser.cpp:66] CGP MQTT IP set to 10.4.1.121
[11] I0430 19:24:42.494819 13396 ArgParser.cpp:73] CGP MQTT Port set to 1883
[11] I0430 19:24:42.494819 13396 ArgParser.cpp:80] CGP MQTT ID set to CGP
[10] WARNING: Logging before InitGoogleLogging() is written to STDERR
[10] I0430 19:24:42.501842 3068 ArgParser.cpp:93] CC Config path set to ..\input\cc\cc_veo.xml
[10] I0430 19:24:42.504853 3068 ArgParser.cpp:181] CC MQTT IP set to 10.4.1.121
[10] I0430 19:24:42.504853 3068 ArgParser.cpp:188] CC MQTT Port set to 1883
[10] I0430 19:24:42.504853 3068 ArgParser.cpp:195] CC MQTT ID set to CC
[10] I0430 19:24:42.504853 3068 Executor.cpp:53] initMpi
[2] WARNING: Logging before InitGoogleLogging() is written to STDERR
[3] WARNING: Logging before InitGoogleLogging() is written to STDERR
[3] W0430 19:24:42.506860 9924 ArgParser.cpp:81] No end process provided. Setting cape to execute only start process: tracking
[3] I0430 19:24:42.506860 9924 ArgParser.cpp:406] Group name set to EO50
[6] WARNING: Logging before InitGoogleLogging() is written to STDERR
[6] W0430 19:24:42.506860 5404 ArgParser.cpp:81] No end process provided. Setting cape to execute only start process: stabilization
[6] I0430 19:24:42.506860 5404 ArgParser.cpp:406] Group name set to EO100
[6] I0430 19:24:42.506860 5404 Executor.cpp:53] initMpi
[2] W0430 19:24:42.506860 16752 ArgParser.cpp:81] No end process provided. Setting cape to execute only start process: bgsubtraction
[2] I0430 19:24:42.506860 16752 ArgParser.cpp:406] Group name set to EO50
[2] I0430 19:24:42.506860 16752 Executor.cpp:53] initMpi
[3] I0430 19:24:42.506860 9924 Executor.cpp:53] initMpi
[5] WARNING: Logging before InitGoogleLogging() is written to STDERR
[8] WARNING: Logging before InitGoogleLogging() is written to STDERR
[8] W0430 19:24:42.506860 11428 ArgParser.cpp:81] No end process provided. Setting cape to execute only start process: tracking
[8] I0430 19:24:42.506860 11428 ArgParser.cpp:406] Group name set to EO100
[8] I0430 19:24:42.506860 11428 Executor.cpp:53] initMpi
[1] WARNING: Logging before InitGoogleLogging() is written to STDERR
[1] W0430 19:24:42.506860 14572 ArgParser.cpp:81] No end process provided. Setting cape to execute only start process: stabilization
[1] I0430 19:24:42.506860 14572 ArgParser.cpp:406] Group name set to EO50
[5] W0430 19:24:42.506860 16108 ArgParser.cpp:81] No end process provided. Setting cape to execute only start process: stitching
[1] I0430 19:24:42.506860 14572 Executor.cpp:53] initMpi
[5] I0430 19:24:42.506860 16108 ArgParser.cpp:356] Config path is set to ..\input\cape\cape_fl_veo_100_dual.xml
[5] I0430 19:24:42.506860 16108 ArgParser.cpp:363] MQTT IP is set to 10.4.1.121
[5] I0430 19:24:42.506860 16108 ArgParser.cpp:370] MQTT Port is set to 1883
[5] I0430 19:24:42.506860 16108 ArgParser.cpp:377] MQTT ID is set to CAPE100
[5] I0430 19:24:42.506860 16108 ArgParser.cpp:384] DDS ID is set to 121
[5] I0430 19:24:42.506860 16108 ArgParser.cpp:391] DDS Allow Interface is set to 10.4.1.*
[5] I0430 19:24:42.506860 16108 ArgParser.cpp:398] GCP DB IP set to 10.4.1.122
[5] I0430 19:24:42.506860 16108 ArgParser.cpp:406] Group name set to EO100
[5] I0430 19:24:42.506860 16108 ArgParser.cpp:184] Exposure will be executed along with provided processes
[7] WARNING: Logging before InitGoogleLogging() is written to STDERR
[7] W0430 19:24:42.506860 1656 ArgParser.cpp:81] No end process provided. Setting cape to execute only start process: bgsubtraction
[7] I0430 19:24:42.506860 1656 ArgParser.cpp:406] Group name set to EO100
[5] I0430 19:24:42.506860 16108 Executor.cpp:53] initMpi
[7] I0430 19:24:42.506860 1656 Executor.cpp:53] initMpi
[0] WARNING: Logging before InitGoogleLogging() is written to STDERR
[0] W0430 19:24:42.506860 14212 ArgParser.cpp:81] No end process provided. Setting cape to execute only start process: stitching
[0] I0430 19:24:42.507864 14212 ArgParser.cpp:356] Config path is set to ..\input\cape\cape_fl_veo_50_dual.xml
[0] I0430 19:24:42.507864 14212 ArgParser.cpp:363] MQTT IP is set to 10.4.1.121
[0] I0430 19:24:42.507864 14212 ArgParser.cpp:370] MQTT Port is set to 1883
[0] I0430 19:24:42.507864 14212 ArgParser.cpp:377] MQTT ID is set to CAPE50
[0] I0430 19:24:42.507864 14212 ArgParser.cpp:384] DDS ID is set to 121
[0] I0430 19:24:42.507864 14212 ArgParser.cpp:391] DDS Allow Interface is set to 10.4.1.*
[0] I0430 19:24:42.507864 14212 ArgParser.cpp:398] GCP DB IP set to 10.4.1.122
[0] I0430 19:24:42.507864 14212 ArgParser.cpp:406] Group name set to EO50
[0] I0430 19:24:42.507864 14212 ArgParser.cpp:184] Exposure will be executed along with provided processes
[0] I0430 19:24:42.507864 14212 ArgParser.cpp:240] CAPE will attempt to calculate elevation matrix from dted file if any processes require it.
[0] I0430 19:24:42.507864 14212 Executor.cpp:53] initMpi
[12] WARNING: Logging before InitGoogleLogging() is written to STDERR
[12] I0430 19:24:42.523279 261596 ArgParser.cpp:56] GMS MQTT IP set to 10.4.1.121
[12] I0430 19:24:42.523279 261596 ArgParser.cpp:63] GMS MQTT Port set to 1883
[12] I0430 19:24:42.523279 261596 ArgParser.cpp:70] GMS MQTT ID set to GMS
[12] I0430 19:24:42.523279 261596 ArgParser.cpp:77] GMS DDS Reader Domain ID set to 121
[12] I0430 19:24:42.523279 261596 ArgParser.cpp:84] GMS DDS Writer Domain ID set to 122
[12] I0430 19:24:42.523279 261596 ArgParser.cpp:91] GMS DDS Allow Interface for Reader set to 10.4.1.*
[12] I0430 19:24:42.523279 261596 ArgParser.cpp:98] GMS DDS Allow Interface for Writer set to 10.4.1.*
[4] WARNING: Logging before InitGoogleLogging() is written to STDERR
[4] I0430 19:24:42.537328 260632 ArgParser.cpp:133] Dissemination Medium Type is set as: 'DDS+FILE'
[4] I0430 19:24:42.537328 260632 ArgParser.cpp:406] Group name set to EO50
[4] I0430 19:24:42.537328 260632 Executor.cpp:53] initMpi
[9] WARNING: Logging before InitGoogleLogging() is written to STDERR
[9] I0430 19:24:42.538331 259540 ArgParser.cpp:133] Dissemination Medium Type is set as: 'DDS+FILE'
[9] I0430 19:24:42.538331 259540 ArgParser.cpp:406] Group name set to EO100
[9] I0430 19:24:42.538331 259540 Executor.cpp:53] initMpi
[11] I0430 19:24:43.496381 13396 Manager.cpp:150] Application Mode : INITIALIZING published!
[11] I0430 19:24:43.496381 13396 Executor.cpp:53] initMpi
[12] I0430 19:24:43.524725 261596 GMSApplication.cpp:134] DDS Initialization ...
[12] I0430 19:24:44.087589 261596 GMSApplication.cpp:144] DDS Reader initialized ! [121 - 10.4.1.*][12]
[12] I0430 19:24:44.087589 261596 GMSApplication.cpp:145] DDS Writer initialized ! [122 - 10.4.1.*]
[12] I0430 19:24:44.087589 261596 GMSApplication.cpp:87] MPI Initialization ...
[12] I0430 19:24:44.087589 261596 Executor.cpp:53] initMpi
[0] [0] MPI startup(): Intel(R) MPI Library, Version 2017 Update 1 Build 20161016[0]
[0] [0] MPI startup(): Copyright (C) 2003-2016 Intel Corporation. All rights reserved.
[0] [0] MPI startup(): Multi-threaded optimized library
[12] [12] MPI startup(): shm and tcp data transfer modes
[4] [4] MPI startup(): shm and tcp data transfer modes[4]
[9] [9] MPI startup(): shm and tcp data transfer modes[2] [2] MPI startup(): shm and tcp data transfer modes[2]
[9]
[3] [3] MPI startup(): shm and tcp data transfer modes
[1] [1] MPI startup(): shm and tcp data transfer modes
[0] [0] MPI startup(): shm and tcp data transfer modes
[11] [11] MPI startup(): shm and tcp data transfer modes
[5] [5] MPI startup(): shm and tcp data transfer modes
[7] [7] MPI startup(): shm and tcp data transfer modes
[6] [6] MPI startup(): shm and tcp data transfer modes
[10] [10] MPI startup(): shm and tcp data transfer modes
[8] [8] MPI startup(): shm and tcp data transfer modes
[10] Fatal error in PMPI_Init_thread: Other MPI error, error stack:
[10] MPIR_Init_thread(805)......................: fail failed
[10] MPID_Init(1783)............................: channel initialization failed
[10] MPIDI_CH3_Init(147)........................: fail failed
[10] MPID_nem_tcp_post_init(351)................: fail failed
[10] MPID_nem_newtcp_module_connpoll(3116)......: fail failed
[10] recv_id_or_tmpvc_info_success_handler(1336): read from socket failed - No error

When using dapl:dapl I'm getting "MPIR_Init_Thread" error. Here is the full trace:

mpiexec -l -genv I_MPI_FABRICS dapl:dapl -genv I_MPI_PIN_DOMAIN=omp -genv I_MPI_WAIT_MODE=1           -n 1 -host 10.0.0.1 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe master --do stitching --dted --exposure --config-path ..\input\cape\cape_fl_veo_50_dual.xml  --mqtt-ip 10.4.1.121  --mqtt-port 1883  --mqtt-id CAPE50  --dds-id 121  -i 10.4.1.*  --gcp-db-ip 10.4.1.122 --group-name EO50          : -n 1 -host 10.0.0.1 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe slave --do stabilization --group-name EO50          : -n 1 -host 10.0.0.1 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe slave --do bgsubtraction --group-name EO50          : -n 1 -host 10.0.0.1 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe slave --do tracking --group-name EO50          : -n 1 -host 10.0.0.2 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe slave -d -o dds file --group-name EO50          : -n 1 -host 10.0.0.1 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe master --do stitching --exposure --config-path ..\input\cape\cape_fl_veo_100_dual.xml  --mqtt-ip 10.4.1.121  --mqtt-port 1883  --mqtt-id CAPE100  --dds-id 121  -i 10.4.1.*  --gcp-db-ip 10.4.1.122 --group-name EO100          : -n 1 -host 10.0.0.1 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe slave --do stabilization --group-name EO100          : -n 1 -host 10.0.0.1 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe slave --do bgsubtraction --group-name EO100          : -n 1 -host 10.0.0.1 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe slave --do tracking --group-name EO100          : -n 1 -host 10.0.0.2 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe slave -d -o dds file --group-name EO100          : -n 1 -host 10.0.0.1 ../CameraController-2.3.1-48-SNAPSHOT-windows-amd64-vc14/bin/CameraController.exe --config-path ..\input\cc\cc_veo.xml  --mqtt-ip 10.4.1.121  --mqtt-port 1883  --mqtt-id CC          : -n 1 -host 10.0.0.1 ../CameraGroupProxy-0.0.2-41-SNAPSHOT-windows-amd64-vc14/bin/CameraGroupProxy.exe --config-path ..\input\cgp\configuration.xml  --mqtt-ip 10.4.1.121  --mqtt-port 1883  --mqtt-id CGP          : -n 1 -host 10.0.0.2 ../GroupMetadataSynchronizer-0.0.2-42-SNAPSHOT-windows-amd64-vc14/bin/GroupMetadataSynchronizer.exe --config-path ..\input\gms\configuration.xml  --mqtt-ip 10.4.1.121  --mqtt-port 1883  --mqtt-id GMS  --dds-reader-id 121  --dds-reader-allow-interface 10.4.1.*  --dds-writer-id 122  --dds-writer-allow-interface 10.4.1.*
[11] WARNING: Logging before InitGoogleLogging() is written to STDERR
[11] I0503 09:44:51.214726  8132 ArgParser.cpp:66] CGP MQTT IP set to 10.4.1.121
[11] I0503 09:44:51.215728  8132 ArgParser.cpp:73] CGP MQTT Port set to 1883
[11] I0503 09:44:51.215728  8132 ArgParser.cpp:80] CGP MQTT ID set to CGP
[12] WARNING: Logging before InitGoogleLogging() is written to STDERR
[12] I0503 09:45:04.446153  7932 ArgParser.cpp:56] GMS MQTT IP set to 10.4.1.121
[12] I0503 09:45:04.446153  7932 ArgParser.cpp:63] GMS MQTT Port set to 1883
[12] I0503 09:45:04.447154  7932 ArgParser.cpp:70] GMS MQTT ID set to GMS
[12] I0503 09:45:04.447154  7932 ArgParser.cpp:77] GMS DDS Reader Domain ID set to 121
[12] I0503 09:45:04.447154  7932 ArgParser.cpp:84] GMS DDS Writer Domain ID set to 122
[12] I0503 09:45:04.447154  7932 ArgParser.cpp:91] GMS DDS Allow Interface for Reader set to 10.4.1.*
[12] I0503 09:45:04.447154  7932 ArgParser.cpp:98] GMS DDS Allow Interface for Writer set to 10.4.1.*
[9] WARNING: Logging before InitGoogleLogging() is written to STDERR
[4] WARNING: Logging before InitGoogleLogging() is written to STDERR
[4] I0503 09:45:04.471177  7412 ArgParser.cpp:133] Dissemination Medium Type is set as: 'DDS+FILE'
[4] I0503 09:45:04.471177  7412 ArgParser.cpp:406] Group name set to EO50
[9] I0503 09:45:04.471177  8356 ArgParser.cpp:133] Dissemination Medium Type is set as: 'DDS+FILE'
[9] I0503 09:45:04.471177  8356 ArgParser.cpp:406] Group name set to EO100
[12] I0503 09:45:05.447875  7932 GMSApplication.cpp:134] DDS Initialization ...
[12] I0503 09:45:06.021322  7932 GMSApplication.cpp:144] DDS Reader initialized ! [121 - 10.4.1.*]
[12] I0503 09:45:06.021322  7932 GMSApplication.cpp:145] DDS Writer initialized ! [122 - 10.4.1.*]
[12] I0503 09:45:06.021322  7932 GMSApplication.cpp:87] MPI Initialization ...
[11] I0503 09:44:53.217674  8132 Manager.cpp:150] Application Mode : INITIALIZING published!
[4] dapls_ib_get_dto_status() Unknown NT Error 0xc000021b? ret DAT_INTERNAL_ERR
[5] [5:10.0.0.1] unexpected DAPL event 0x4005
[5] Fatal error in PMPI_Init_thread: Internal MPI error!, error stack:
[5] MPIR_Init_thread(805): fail failed
[5] MPID_Init(1783)......: channel initialization failed
[5] MPIDI_CH3_Init(147)..: fail failed
[5] (unknown)(): Internal MPI error!

Any ideas?

0 Kudos
0 Replies