Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2154 Discussions

MPI 4 multi nodes running problem on Windows

Yongjun
Beginner
1,219 Views

I just upgrade Intel MPI for Windows from 3.0.012 to 4.0.0.011. After I upgrade, I can run parallel case in a single node without problem. If I run parallel case cross multi nodes, my program always stopped. I debug the running status, processes were started up at shm data transfer mode. If I set I_MPI_FABRICS to shm:tcp, program also stopped. If I set I_MPI_FABRICS to tcp, program can run. If I set I_MPI_FABRICS to dapl and set I_MPI_FALLBACK to enable, program can run. But that is not want I want. We are developing a commercial software, we want MPI can select fabrics automatically, our users may not know the detail to set those environment variables. The problem happed on both Windows XP 64bit and Windows 7 64bit version. Does anyone meet the same problem? Thanks,

0 Kudos
15 Replies
Dmitry_K_Intel2
Employee
1,219 Views
Hello,

Could you check version of smpd running on your nodes? You get the information running the following command in a command window:
smpd -get binary
It shoould be from version 4.0.

Could you also check old environment - might some env variables left from the previous version. At least there should be no I_MPI_DEVICE.

Running Intel MPI by default fallback (I_MPI_FALLBACK) is enabled so the library will check all existing fast fabrics and if they are not available fallbacks to tcp. You can see what fast fabric has been selected by setting I_MPI_DEBUG=2 (or higher).

When you set I_MPI_FABRICS=shm:tcp and everything works just fine that means that something prevents to run in the same way in default mode.

BTW: You can upgrade 4.0.0.011 to 4.0.1 (and very soon to 4.0.2)

Regards!
Dmitry
0 Kudos
Yongjun
Beginner
1,219 Views
I run smpd -get binary, it shows
C:\Program Files (x86)\Intel\MPI-RT\4.0.0.012\em64t\bin\smpd.exe
If I run smpd V, it shows
Intel MPI Library for Windows* OS, Version 4.0 Build 2/18/2010 1:00:47 PM
Copyright (C) 2007-2010, Intel Corporation. All rights reserved.
If I run smpd version, it shows
3.1
I installed our package on two fresh installed Windows 7 64bit computers. I think there some bugs in MPI version 4.0.0.012.
I want to know if I need to install any other librarys.
0 Kudos
Dmitry_K_Intel2
Employee
1,219 Views
>If I set I_MPI_FABRICS to shm:tcp, program also stopped.
Could you run your program like:
mpiexec -genv I_MPI_DEBUG 5 -genv I_MPI_FABRICS shm:tcp -hosts [your hosts and number of processes] ./app_name

And place the output here.

Regards!
Dmitry
0 Kudos
Yongjun
Beginner
1,219 Views
Thanks, Dmitry
I ran 5 cases.
-------------------------------------------------------------------------------------------------------
First case, I ran without -genv I_MPI_FABRICS shm:tcp, and I set host name to the same, gems3. The output is below.
mpiexec -wdir "mydir" -genv I_MPI_DEBUG 5 -hosts 2 gems3 3 gems3 2 -pwdfile "mypassword" "myfile"
[3] MPI startup(): shm data transfer mode
[2] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): shm data transfer mode
[4] MPI startup(): shm data transfer mode
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS_LIST=dapl,tcp
[0] MPI startup(): I_MPI_FALLBACK=enable
[0] MPI startup(): NUMBER_OF_PROCESSORS=1
[0] MPI startup(): PROCESSOR_IDENTIFIER=AMD64 Family 16 Model 5 Stepping 3, AuthenticAMD
[0] Rank Pid Node name Pin cpu
[0] 0 672 gems3 n/a
[0] 1 3056 gems3 n/a
[0] 2 2960 gems3 n/a
[0] 3 2116 gems3 n/a
[0] 4 2700 gems3 n/a
Running time : 0:00:01.
-------------------------------------------------------------------------------------------------------
Second case, I ran with -genv I_MPI_FABRICS shm:tcp, and I also set host name to the same, gems3. The output is below.
mpiexec -wdir "mydir" -genv I_MPI_DEBUG 5 -genv I_MPI_FABRICS shm:tcp -hosts 2 gems3 3 gems3 2 -pwdfile "mypassword" "myfile"
[4] MPI startup(): shm and tcp data transfer modes
[3] MPI startup(): shm and tcp data transfer modes
[2] MPI startup(): shm and tcp data transfer modes
[1] MPI startup(): shm and tcp data transfer modes
[0] MPI startup(): shm and tcp data transfer modes
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=shm:tcp
[0] MPI startup(): I_MPI_FABRICS_LIST=dapl,tcp
[0] MPI startup(): I_MPI_FALLBACK=enable
[0] MPI startup(): NUMBER_OF_PROCESSORS=1
[0] MPI startup(): PROCESSOR_IDENTIFIER=AMD64 Family 16 Model 5 Stepping 3, AuthenticAMD
[0] Rank Pid Node name Pin cpu
[0] 0 2348 gems3 n/a
[0] 1 2712 gems3 n/a
[0] 2 2568 gems3 n/a
[0] 3 2192 gems3 n/a
[0] 4 2408 gems3 n/a
Running time : 0:00:01.
The above two cases all work fine, because they start on same computer.
-------------------------------------------------------------------------------------------------------
Third case, I ran without -genv I_MPI_FABRICS shm:tcp, and I set host name to the two different name, gem3 and gems4. The output is below.
mpiexec -wdir "mydir" -genv I_MPI_DEBUG 5 -hosts 2 gems3 3 gems4 2 -pwdfile "mypassword" "myfile"
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): shm data transfer mode
[2] MPI startup(): shm data transfer mode
[3] MPI startup(): shm data transfer mode
[4] MPI startup(): shm data transfer mode
job aborted:
rank: node: exit code[: error message]
0: gems3: -1073741819: process 0 exited without calling finalize
1: gems3: -1073741819: process 1 exited without calling finalize
2: gems3: 123
3: gems4: 123
4: gems4: 123
-------------------------------------------------------------------------------------------------------
Forth case, I ran with -genv I_MPI_FABRICS shm:tcp, and I set host name to the two different name, gem3 and gems4. The output is below.
mpiexec -wdir "mydir" -genv I_MPI_DEBUG 5 -genv I_MPI_FABRICS shm:tcp -hosts 2 gems3 3 gems4 2 -pwdfile "mypassword" "myfile"
[1] MPI startup(): shm and tcp data transfer modes
[0] MPI startup(): shm and tcp data transfer modes
[2] MPI startup(): shm and tcp data transfer modes
[3] MPI startup(): shm and tcp data transfer modes
[4] MPI startup(): shm and tcp data transfer modes
job aborted:
rank: node: exit code[: error message]
0: gems3: -1073741819: process 0 exited without calling finalize
1: gems3: -1073741819: process 1 exited without calling finalize
2: gems3: 123
3: gems4: 123
4: gems4: 123
The above two cases don't work, because they start on different computer.
-------------------------------------------------------------------------------------------------------
Fifth case, I ran with -genv I_MPI_FABRICS tcp, and I set host name to the two different name, gem3 and gems4. The output is below.
mpiexec -wdir "mydir" -genv I_MPI_DEBUG 5 -genv I_MPI_FABRICS tcp -hosts 2 gems3 3 gems4 2 -pwdfile "mypassword" "myfile"
[0] MPI startup(): tcp data transfer mode
[2] MPI startup(): tcp data transfer mode
[1] MPI startup(): tcp data transfer mode
[3] MPI startup(): tcp data transfer mode
[4] MPI startup(): tcp data transfer mode
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_FABRICS=tcp
[0] MPI startup(): I_MPI_FABRICS_LIST=dapl,tcp
[0] MPI startup(): I_MPI_FALLBACK=enable
[0] MPI startup(): NUMBER_OF_PROCESSORS=1
[0] MPI startup(): PROCESSOR_IDENTIFIER=AMD64 Family 16 Model 5 Stepping 3, AuthenticAMD
[0] Rank Pid Node name Pin cpu
[0] 0 2080 gems3 n/a
[0] 1 2908 gems3 n/a
[0] 2 920 gems3 n/a
[0] 3 1464 gems4 n/a
[0] 4 2096 gems4 n/a
Running time : 0:00:01.
This case works fine.
0 Kudos
Dmitry_K_Intel2
Employee
1,219 Views
Well, it's not clear why the library doesn't work in case of shm:tcp. Do gems3 and gems4 have the same CPUs?
Could you try the following command:
mpiexec -wdir "mydir" -genv I_MPI_DEBUG 5 -genv I_MPI_PLATFORM 0 -genv I_MPI_FABRICS shm:tcp -hosts 2 gems3 3 gems4 2 -pwdfile "mypassword" "myfile"

If I_MPI_PLATFORM doesn't help please download Intel MPI Library version 4.0 Update 1 and give it try. Remember that it should be updated on all nodes.

Regards!
Dmitry
0 Kudos
Yongjun
Beginner
1,219 Views
Dear Dmitry,
Thanks. I tried-genv I_MPI_PLATFORM 0, it still doesn't help. The two computers have same CPUs.
I also upgrade MPI library from 4.0.0.012 to 4.0.1.007, and re-compile program. but the problem is the same. If I use MPI library 3.2.012, my program works fine.
Our customers have different Network, like Infiniband, myrinet and ethernet. But many of them don't understand network settings. We want to run our program with default setting and the program can work properly. We ran our program with MPI 3.2.012, it can automatically select network. But with MPI version 4, it even cannot start between two computers.
0 Kudos
Dmitry_K_Intel2
Employee
1,219 Views
Could you please check environment on both computers. Highly possible that something left from your previous installation. Especially look for I_MPI_DEVICE. Your run by default on 2 computers starts using shm only mode somehow - it's not what we are expecting.

"Third case, I ran without -genv I_MPI_FABRICS shm:tcp, and I set host name to the two different name, gem3 and gems4. The output is below.

mpiexec -wdir "mydir" -genv I_MPI_DEBUG 5 -hosts 2 gems3 3 gems4 2 -pwdfile "mypassword" "myfile"

[1] MPI startup(): shm data transfer mode

[0] MPI startup(): shm data transfer mode"


You don't need any other library - everything should work fine.
Do you use a script to run your application? Might be you make some settings there?

Regards!
Dmitry
0 Kudos
Yongjun
Beginner
1,219 Views
Hi Dmitry,
The following is the environment setting. Both computers are almost the same. There is noI_MPI_DEVICE setting

BTW, the two computers are fresh installation, no previous MPI was installed.

ALLUSERSPROFILE=C:\ProgramData
APPDATA=C:\Users\gems\AppData\Roaming
CommonProgramFiles=C:\Program Files\Common Files
CommonProgramFiles(x86)=C:\Program Files (x86)\Common Files
CommonProgramW6432=C:\Program Files\Common Files
COMPUTERNAME=GEMS3
ComSpec=C:\Windows\system32\cmd.exe
FP_NO_HOST_CHECK=NO
HOMEDRIVE=C:
HOMEPATH=\Users\gems
INTEL_LICENSE_FILE=C:\Program Files (x86)\Common Files\Intel\Licenses
I_MPI_FABRICS_LIST=dapl,tcp
I_MPI_FALLBACK=enable
I_MPI_ROOT=C:\Program Files (x86)\Intel\MPI\4.0.1.007\
LOCALAPPDATA=C:\Users\gems\AppData\Local
LOGONSERVER=\\GEMS3
NUMBER_OF_PROCESSORS=1
OS=Windows_NT
Path="C:\Program Files (x86)\Intel\MPI\4.0.1.007\em64t\bin";C:\Program Files (x86)\Intel\MPI\4.0.1.007\em64t\bin;C:\Windows\system32;C:\Windows;C:\Windows\Syst
em32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\
PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC
PROCESSOR_ARCHITECTURE=AMD64
PROCESSOR_IDENTIFIER=AMD64 Family 16 Model 5 Stepping 3, AuthenticAMD
PROCESSOR_LEVEL=16
PROCESSOR_REVISION=0503
ProgramData=C:\ProgramData
ProgramFiles=C:\Program Files
ProgramFiles(x86)=C:\Program Files (x86)
ProgramW6432=C:\Program Files
PROMPT=$P$G
PSModulePath=C:\Windows\system32\WindowsPowerShell\v1.0\Modules\
PUBLIC=C:\Users\Public
SESSIONNAME=Console
SystemDrive=C:
SystemRoot=C:\Windows
TEMP=C:\Users\gems\AppData\Local\Temp
TMP=C:\Users\gems\AppData\Local\Temp
USERDOMAIN=gems3
USERNAME=gems
USERPROFILE=C:\Users\gems
windir=C:\Windows

I ran the cases using batch file, we didn't make any setting in batch file . But even if I ran the case on command line, the problems are the same.
0 Kudos
Dmitry_K_Intel2
Employee
1,219 Views
Hi Yongjun,

It's not clear why these variables are in the list:
I_MPI_FABRICS_LIST=dapl,tcp
I_MPI_FALLBACK=enable
They can be removed from the environment.

Could you please compile your program (you can compile HelloWorld example from the test directory instead) with debug information and run it with I_MPI_FABRICS=shm:tcp on 2 nodes with I_MPI_DEBUG=50.
Please send me only lines with "business card" in them.

It looks like gems3 and gems4 are considered to have the same ip address. Could you please please check that they have different ip addresses?

Regards!
Dmitry
0 Kudos
Yongjun
Beginner
1,219 Views
Hi Dmitry,
I compiled the test case comes with Intel MPI package. the host names in ma.win are the IP address of the computers. They are192.168.206.132 and192.168.206.134. I am quite sure the two computers have different IP address.
I ran two cases.
First, I ran with shm:tcp, I cannot find business card output.
mpiexec -genv I_MPI_DEBUG 50 -genv I_MPI_FABRICS shm:tcp -n 2 -machinefile ma.win -pwdfile pa.win test
[0] MPI startup(): Intel MPI Library, Version 4.0 Update 1 Build 20100910
[0] MPI startup(): Copyright (C) 2003-2010 Intel Corporation. All rights reserved.
[0] MPI startup(): I_MPI_LIBRARY_VERSION: 4.0 Update 1
[0] MPI startup(): I_MPI_VERSION_DATE_OF_BUILD: 9/10/2010 2:02:16 PM
[0] MPI startup(): I_MPI_VERSION_MY_CMD_LINE: winconfigure.wsf
[0] MPI startup(): I_MPI_VERSION_MACHINENAME: SVLMPIBLD07
[0] MPI startup(): I_MPI_DEVICE_VERSION: 4.0 Update 1 9/10/2010
[1] MPID_nem_impi_init_shm_configuration(): shm topology: windows pinning is unavailable
[1] MPID_nem_impi_init_shm_configuration(): shm memcpy: cache bypass thresholds: 16384,2097152,-1,2097152,-1,2097152
[1] MPID_nem_impi_init_shm_configuration(): shm topology: pinning is unavailable
[0] MPID_nem_impi_init_shm_configuration(): shm topology: windows pinning is unavailable
[0] MPID_nem_impi_init_shm_configuration(): shm memcpy: cache bypass thresholds: 16384,2097152,-1,2097152,-1,2097152
[0] MPID_nem_impi_init_shm_configuration(): shm topology: pinning is unavailable
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(527).................: Initialization failed
MPID_Init(171)........................: channel initialization failed
MPIDI_CH3_Init(70)....................:
MPID_nem_init_ckpt(665)...............:
MPIDI_CH3I_Seg_commit(372)............:
MPIU_SHMW_Hnd_deserialize(362)........:
MPIU_SHMW_Seg_open(942)...............:
MPIU_SHMW_Seg_create_attach_templ(826): unable to allocate shared memory - OpenFileMapping The system cannot find the file specified.
job aborted:
rank: node: exit code[: error message]
0: 192.168.206.132: 123
1: 192.168.206.134: 1: process 1 exited without calling finalize



Second, I ran with tcp option, program ran ok. We can find the business card output
mpiexec -genv I_MPI_DEBUG 50 -genv I_MPI_FABRICS tcp -n 2 -machinefile ma.win -pwdfile pa.win test
[0] MPID_nem_init_ckpt(): business card: description="gems3 gems3 " port=23235 ifname="" fabrics_list=tcp
[0] getConnInfoKVS(): got business card: description="gems4 gems4 " port=33985 ifname="" fabrics_list=tcp
[1] MPID_nem_init_ckpt(): business card: description="gems4 gems4 " port=33985 ifname="" fabrics_list=tcp
0 Kudos
Dmitry_K_Intel2
Employee
1,219 Views
Hi Yongjun,

Well, it seems to me that you are using computer names without DNS suffix. Please check this suffix in "My Computer"-> System Properties->Computer Name (Tab)->"Change..." button.
Full computer name should have DNS suffix. If it's not so (computer name looks like 'gems3'), please press "More..." button and type a suffix in the "Primary DNS suffix" field.
If you don't have domain name you can try to use 'local'.
You need to do this on each computer you are going to use.

Please do it and try to run a program with default parameters.

Regards!
Dmitry
0 Kudos
Yongjun
Beginner
1,219 Views
Hi Dmitry,
Thanks for your help. I added DNS suffix. Now our program can run with default paramaters. Every thing works fine.
I have a question. In Intel MPI version 3, DNS suffix isn't needed. We can let it empty. Is it a new requirement in MPI version 4?
By default, DNS suffix is emptyif the computer doesn't join any domain. Does MPI version 4 require it must be set if computer doesn't join any domain?
If I don't have domain name,how to use "local" ?
Regards,
Yongjun
0 Kudos
Dmitry_K_Intel2
Employee
1,219 Views
Yongjun,
this is not a requirement but sometimes it works in unexpected way if there is no DNS suffix. We are investigating the issue. For now just add a suffix - nothing else will be needed.

Regards!
Dmitry
0 Kudos
Seifer_Lin
Beginner
1,219 Views
Hi Dmitry:
Will this issue be fixed in the update version of Intel MPI (ex: 4.0.2.006) ?
From our experience:
On a cluster of Windows server 2008 and Windows XP
(1) the DNS suffix must be added to prevent the failure of OpenFileMapping when using Intel MPI.
(2) Once the DNS suffix is added, the programs based on MPICH2 (that some of our customers already use) will suffer the failure of gethostbyname(). Our customers are unhappy about this ...
regards,
Seifer
0 Kudos
Dmitry_K_Intel2
Employee
1,219 Views
Hi Seifer,

This issue will be fixed in the upcoming 4.0 Update 3 release which should be available for customers sometime in November. I hope that this fix will resolve inconsistency between different implementations of MPI.

Regards!
Dmitry
0 Kudos
Reply