Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2161 Discussions

Using MSMPI.dll - fortran crash forrtl 157

Reeves__Nathan
Beginner
794 Views
 
 

I am trying to get a simple example working across 2 amazon nodes. When I run locally, its all fine. Then when I use -machinefile to launch across both nodes, the sample application on the slave node throws an exception. forrtl 157.  I can run using -machinefile and just run both locally. And I can run using -machine file and just run everything on the remote node. But when I add both addresses to the hosts.txt, it seems the sample application on the slave node crashes. Node names are the same as they are identical amazon images. I have tried disabling firewalls.  It always crashs when the slave node (or rank 1) calls MPI_SEND.  However, when I run with out -machinefile, and just run on the same machine,it all works fine with -n 4 etc.

Any pointers as to where I should start looking, thanks.

Here is my sample application:

  program hello
      include 'mpif.h'
      parameter (MASTER = 0)
 
      integer numtasks, taskid, len, ierr, rank, size,count
      character(MPI_MAX_PROCESSOR_NAME) hostname
      double precision data(100)
     integer status(MPI_STATUS_SIZE)
integer tag

 
 
     
     
!      write (*,*) 'Starting'
     
      call MPI_INIT(ierr)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)
      call MPI_COMM_RANK(MPI_COMM_WORLD, taskid, ierr)
     
     call MPI_GET_PROCESSOR_NAME(hostname, len, ierr)
     write(*,20) taskid, hostname
 

       if (taskid .eq. MASTER) then
        write(*,30) numtasks
       end if
 
        do i=1, 10
          data(i) = (i * (taskid + 1))
        end do
 
        count = 10
        tag = 666
 
        ! All slaves send back data
!        if (taskid >= 1) then
       if (taskid .ne. MASTER) then
            call MPI_SEND( data, count, MPI_DOUBLE_PRECISION, 0, tag, MPI_COMM_WORLD, ierr )
            write  (*,50) taskid
        end if
       
        count = 10
       if (taskid .eq. MASTER) then
         ! receive from each slave
           do i=1, numtasks-1
           call MPI_RECV(data, count, MPI_DOUBLE_PRECISION, i , tag, MPI_COMM_WORLD, status, ierr )
           write (*,40) i
           do j=1, count
               write (*,*) data(j)
           end do
         end do
       end if
      
      
      call MPI_FINALIZE(ierr)
 
20    format('Hello from task ',I2,' on ',A48)
30    format('MASTER: Number of MPI tasks is: ',I2)
40    format('MASTER: Received from: ',I2)
50    format('SLAVE:',I2,' : Sending to Master')
 
      end
HERE is what i see on the console
Hello from task  0 on WIN-MSDP9MK1V14
MASTER: Number of MPI tasks is:  2
Hello from task  1 on WIN-MSDP9MK1V14
forrtl: severe (157): Program Exception - access violation
Image              PC                Routine            Line        Source
 
msmpi.dll          00007FFCE7018FA6  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE7018C68  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE7018492  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE7012768  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE70130C6  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE707E87F  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE707DE8C  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE70930A5  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE7092785  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE703B2DB  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE70BD99B  Unknown               Unknown  Unknown
F-testapplication  00007FF785EB1240  Unknown               Unknown  Unknown
F-testapplication  00007FF785EB274E  Unknown               Unknown  Unknown
F-testapplication  00007FF785EB2B24  Unknown               Unknown  Unknown
KERNEL32.DLL       00007FFD038713D2  Unknown               Unknown  Unknown
ntdll.dll          00007FFD05CF5444  Unknown               Unknown  Unknown
 
job aborted:
[ranks] message
 
[0] terminated
 
[1] process exited without calling finalize
 
---- error analysis -----
 
[1] on 10.249.60.161
F-testapplication-nodebug.exe ended prematurely and may have crashed. exit code
157
 
---- error analysis -----
 
and this is what I see on the Master SMPD process:
[-1:1220] Launching SMPD service.
[-1:1220] smpd listening on port 8677
[-1:1220] Authentication completed. Successfully obtained Context for Client.
[-1:1220] version check complete, using PMP version 3.
[-1:1220] create manager process (using smpd daemon credentials)
[-1:1220] smpd reading the port string from the manager
[-1:2700] Launching smpd manager instance.
[-1:2700] created set for manager listener, 236
[-1:2700] smpd manager listening on port 49472
[-1:1220] closing the pipe to the manager
[-1:2700] Authentication completed. Successfully obtained Context for Client.
[-1:2700] Authorization completed.
[-1:2700] version check complete, using PMP version 3.
[-1:2700] Received session header from parent id=2, parent=1, level=1
[02:2700] Connecting back to parent using host WIN-MSDP9MK1V14 and endpoint 4953
6
[02:2700] Previous attempt failed, trying again with a resolved parent host 10.2
51.11.178:49536
[02:2700] Authentication completed. Successfully obtained Context for Client.
[02:2700] Authorization completed.
[02:2700] handling command SMPD_COLLECT src=0
[02:2700] handling command SMPD_LAUNCH src=0
[02:2700] Successfully handled bcast nodeids command.
[02:2700] setting environment variable: <MPIEXEC_HOSTNAME> = <WIN-MSDP9MK1V14>
[02:2700] env: PMI_SIZE=2
[02:2700] env: PMI_KVS=267b6f49-eac4-46e3-ad68-6ec4dd9d4e4a
[02:2700] env: PMI_DOMAIN=876a283b-5270-45bf-8a2b-812063bbae3e
[02:2700] env: PMI_HOST=localhost
[02:2700] env: PMI_PORT=6a4c59b6-9ba4-473f-bd95-54efa41e77c8
[02:2700] env: PMI_SMPD_ID=2
[02:2700] env: PMI_APPNUM=0
[02:2700] env: PMI_NODE_IDS=s
[02:2700] env: PMI_RANK_AFFINITIES=a
[02:2700] searching for 'F-testapplication-nodebug.exe' in workdir 'C:\Users\Adm
inistrator\Downloads\test'
[02:2700] C>CreateProcess(C:\Users\Administrator\Downloads\test\F-testapplicatio
n-nodebug.exe F-testapplication-nodebug.exe)
[02:2700] env: PMI_RANK=1
[02:2700] env: PMI_SMPD_KEY=0
[02:2700] Authentication completed. Successfully obtained Context for Client.
[02:2700] Authorization completed.
[02:2700] version check complete, using PMP version 3.
[02:2700] 2 -> 0 : returning parent_context: 0 < 2
[02:2700] forwarding command SMPD_INIT to 0
[02:2700] posting command SMPD_INIT to parent, src=2, ctx_key=0, dest=0.
[02:2700] Handling cmd=SMPD_INIT result
[02:2700] forward SMPD_INIT result to dest=2 ctx_key=0
[02:2700] 2 -> 1 : returning parent_context: 1 < 2
[02:2700] Caching business card for rank 1
[02:2700] forwarding command SMPD_BCPUT to 1
[02:2700] posting command SMPD_BCPUT to parent, src=2, ctx_key=0, dest=1.
[02:2700] Handling cmd=SMPD_BCPUT result
[02:2700] forward SMPD_BCPUT result to dest=2 ctx_key=0
[02:2700] handling command SMPD_BARRIER src=2 ctx_key=0
[02:2700] Handling SMPD_BARRIER src=2 ctx_key=0
[02:2700] initializing barrier(267b6f49-eac4-46e3-ad68-6ec4dd9d4e4a): in=1 size=
1
[02:2700] incrementing barrier(267b6f49-eac4-46e3-ad68-6ec4dd9d4e4a) incount fro
m 0 to 1 out of 1
[02:2700] all in barrier, sending barrier to parent.
[02:2700] posting command SMPD_BARRIER to parent, src=2, ctx_key=65535, dest=1.
[02:2700] Handling cmd=SMPD_BARRIER result
[02:2700] cmd=SMPD_BARRIER result will be handled locally
[02:2700] sending reply to barrier command '267b6f49-eac4-46e3-ad68-6ec4dd9d4e4a
'.
[02:2700] read 72 bytes from stdout
[02:2700] posting command SMPD_STDOUT to parent, src=2, dest=0.
[02:2700] 2 -> 1 : returning parent_context: 1 < 2
[02:2700] forwarding command SMPD_BCGET to 1
[02:2700] posting command SMPD_BCGET to parent, src=2, ctx_key=0, dest=1.
[02:2700] Handling cmd=SMPD_STDOUT result
[02:2700] cmd=SMPD_STDOUT result will be handled locally
[02:2700] Handling cmd=SMPD_BCGET result
[02:2700] forward SMPD_BCGET result to dest=2 ctx_key=0
[02:2700] Caching business card for rank 0
[02:2700] read 1024 bytes from stderr
[02:2700] posting command SMPD_STDERR to parent, src=2, dest=0.
[02:2700] read 358 bytes from stderr
[02:2700] posting command SMPD_STDERR to parent, src=2, dest=0.
[02:2700] Handling cmd=SMPD_STDERR result
[02:2700] cmd=SMPD_STDERR result will be handled locally
[02:2700] reading failed, assuming stdout is closed. error 0xc000014b
[02:2700] process_id=0 process refcount == 2, stdout closed.
[02:2700] reading failed, assuming stderr is closed. error 0xc000014b
[02:2700] process_id=0 process refcount == 1, stderr closed.
[02:2700] Handling cmd=SMPD_STDERR result
[02:2700] cmd=SMPD_STDERR result will be handled locally
[02:2700] process_id=0 process refcount == 0, pmi client closed.
[02:2700] process_id=0 rank=1 refcount=0, waiting for the process to finish exit
ing.
[02:2700] creating an exit command for process id=0  rank=1, pid=836, exit code=
157.
[02:2700] posting command SMPD_EXIT to parent, src=2, dest=0.
[02:2700] Handling cmd=SMPD_EXIT result
[02:2700] cmd=SMPD_EXIT result will be handled locally
[02:2700] handling command SMPD_CLOSE from parent
[02:2700] sending 'closed' command to parent context
[02:2700] posting command SMPD_CLOSED to parent, src=2, dest=1.
[02:2700] Handling cmd=SMPD_CLOSED result
[02:2700] cmd=SMPD_CLOSED result will be handled locally
[02:2700] smpd manager successfully stopped listening.
[02:2700] SMPD exiting with error code 0.
 
This is what I am seeing on the slave SMPD
[-1:736] Launching SMPD service.
[-1:736] smpd listening on port 8677
[-1:736] Authentication completed. Successfully obtained Context for Client.
[-1:736] version check complete, using PMP version 3.
[-1:736] create manager process (using smpd daemon credentials)
[-1:736] smpd reading the port string from the manager
[-1:2200] Launching smpd manager instance.
[-1:2200] created set for manager listener, 236
[-1:2200] smpd manager listening on port 49546
[-1:736] closing the pipe to the manager
[-1:2200] Authentication completed. Successfully obtained Context for Client.
[-1:2200] Authorization completed.
[-1:2200] version check complete, using PMP version 3.
[-1:2200] Received session header from parent id=1, parent=0, level=0
[01:2200] Connecting back to parent using host WIN-MSDP9MK1V14 and endpoint 4947
9
[01:2200] Previous attempt failed, trying again with a resolved parent host 10.2
49.60.161:49479
[01:2200] Authentication completed. Successfully obtained Context for Client.
[01:2200] Authorization completed.
[01:2200] handling command SMPD_CONNECT src=0
[01:2200] now connecting to 10.249.60.161
[01:2200] 1 -> 2 : returning SMPD_CONTEXT_LEFT_CHILD
[01:2200] using spn RestrictedKrbHost/10.249.60.161 to contact server
[01:2200] WIN-MSDP9MK1V14 posting a re-connect to
10.249.60.161:49483 in left ch
ild context.
[01:2200] Authentication completed. Successfully obtained Context for Client.
[01:2200] Authorization completed.
[01:2200] version check complete, using PMP version 3.
[01:2200] 1 -> 2 : returning SMPD_CONTEXT_LEFT_CHILD
[01:2200] handling command SMPD_COLLECT src=0
[01:2200] 1 -> 2 : returning left_context
[01:2200] forwarding command SMPD_COLLECT to 2
[01:2200] posting command SMPD_COLLECT to left child, src=0, dest=2.
[01:2200] Handling cmd=SMPD_COLLECT result
[01:2200] forward result SMPD_COLLECT to dest=0
[01:2200] handling command SMPD_STARTDBS src=0
[01:2200] sending start_dbs result command kvs = 67ef6493-0e3a-4e22-a346-
857cf78
1526a.
[01:2200] handling command SMPD_LAUNCH src=0
[01:2200] Successfully handled bcast nodeids command.
[01:2200] setting environment variable: <MPIEXEC_HOSTNAME> = <WIN-MSDP9MK1V14>
[01:2200] env: PMI_SIZE=2
[01:2200] env: PMI_KVS=67ef6493-0e3a-4e22-a346-857cf781526a
[01:2200] env: PMI_DOMAIN=cd05bc63-8a53-4f72-b8d0-46fc66e1ed60
[01:2200] env: PMI_HOST=localhost
[01:2200] env: PMI_PORT=a21dfd59-26f5-487e-bc78-59ff341d4fac
[01:2200] env: PMI_SMPD_ID=1
[01:2200] env: PMI_APPNUM=0
[01:2200] env: PMI_NODE_IDS=s
[01:2200] env: PMI_RANK_AFFINITIES=a
[01:2200] searching for 'F-testapplication-nodebug.exe' in workdir 'C:\Users\Adm
inistrator\Downloads\test'
[01:2200] C>CreateProcess(C:\Users\Administrator\Downloads\test\F-testapplicatio
n-nodebug.exe F-testapplication-nodebug.exe)
[01:2200] env: PMI_RANK=0
[01:2200] env: PMI_SMPD_KEY=0
[01:2200] 1 -> 2 : returning left_context
[01:2200] forwarding command SMPD_LAUNCH to 2
[01:2200] posting command SMPD_LAUNCH to left child, src=0, dest=2.
[01:2200] Handling cmd=SMPD_LAUNCH result
[01:2200] forward result SMPD_LAUNCH to dest=0
[01:2200] Authentication completed. Successfully obtained Context for Client.
[01:2200] Authorization completed.
[01:2200] version check complete, using PMP version 3.
[01:2200] 1 -> 0 : returning parent_context: 0 < 1
[01:2200] forwarding command SMPD_INIT to 0
[01:2200] posting command SMPD_INIT to parent, src=1, ctx_key=0, dest=0.
[01:2200] Authentication completed. Successfully obtained Context for Client.
[01:2200] Authorization completed.
[01:2200] Handling cmd=SMPD_INIT result
[01:2200] forward SMPD_INIT result to dest=1 ctx_key=0
[01:2200] 1 -> 0 : returning parent_context: 0 < 1
[01:2200] forwarding command SMPD_INIT to 0
[01:2200] posting command SMPD_INIT to parent, src=2, ctx_key=0, dest=0.
[01:2200] handling command SMPD_BCPUT src=1 ctx_key=0
[01:2200] Handling SMPD_BCPUT command from smpd 1
        ctx_key=0
        rank=0
        value=port=49553 description="10.251.11.178 WIN-MSDP9MK1V14 " shm_host=W
IN-MSDP9MK1V14 shm_queue=1564:196
        result=success
[01:2200] handling command SMPD_BARRIER src=1 ctx_key=0
[01:2200] Handling SMPD_BARRIER src=1 ctx_key=0
[01:2200] initializing barrier(67ef6493-0e3a-4e22-a346-857cf781526a): in=1 size=
1
[01:2200] incrementing barrier(67ef6493-0e3a-4e22-a346-857cf781526a) incount fro
m 0 to 1 out of 2
[01:2200] Handling cmd=SMPD_INIT result
[01:2200] forward SMPD_INIT result to dest=2 ctx_key=0
[01:2200] handling command SMPD_BCPUT src=2 ctx_key=0
[01:2200] Handling SMPD_BCPUT command from smpd 2
        ctx_key=0
        rank=1
        value=port=49487 description="10.249.60.161 WIN-MSDP9MK1V14 " shm_host=W
IN-MSDP9MK1V14 shm_queue=2392:196
        result=success
[01:2200] handling command SMPD_BARRIER src=2 ctx_key=65535
[01:2200] Handling SMPD_BARRIER src=2 ctx_key=65535
[01:2200] incrementing barrier(67ef6493-0e3a-4e22-a346-857cf781526a) incount fro
m 1 to 2 out of 2
[01:2200] all in barrier, release the barrier.
[01:2200] sending reply to barrier command '67ef6493-0e3a-4e22-a346-857cf781526a
'.
[01:2200] sending reply to barrier command '67ef6493-0e3a-4e22-a346-857cf781526a
'.
[01:2200] read 72 bytes from stdout
[01:2200] posting command SMPD_STDOUT to parent, src=1, dest=0.
[01:2200] read 36 bytes from stdout
[01:2200] posting command SMPD_STDOUT to parent, src=1, dest=0.
[01:2200] Handling cmd=SMPD_STDOUT result
[01:2200] cmd=SMPD_STDOUT result will be handled locally
[01:2200] Handling cmd=SMPD_STDOUT result
[01:2200] cmd=SMPD_STDOUT result will be handled locally
[01:2200] 1 -> 0 : returning parent_context: 0 < 1
[01:2200] forwarding command SMPD_STDOUT to 0
[01:2200] posting command SMPD_STDOUT to parent, src=2, dest=0.
[01:2200] Handling cmd=SMPD_STDOUT result
[01:2200] forward result SMPD_STDOUT to dest=2
[01:2200] Authentication completed. Successfully obtained Context for Client.
[01:2200] Authorization completed.
[01:2200] handling command SMPD_BCGET src=2 ctx_key=0
[01:2200] Handling SMPD_BCGET command from smpd 2
        ctx_key=0
        rank=0
        value=port=49553 description="10.251.11.178 WIN-MSDP9MK1V14 " shm_host=W
IN-MSDP9MK1V14 shm_queue=1564:196
        result=success
[01:2200] 1 -> 0 : returning parent_context: 0 < 1
[01:2200] forwarding command SMPD_STDERR to 0
[01:2200] posting command SMPD_STDERR to parent, src=2, dest=0.
[01:2200] 1 -> 0 : returning parent_context: 0 < 1
[01:2200] forwarding command SMPD_STDERR to 0
[01:2200] posting command SMPD_STDERR to parent, src=2, dest=0.
[01:2200] Handling cmd=SMPD_STDERR result
[01:2200] forward result SMPD_STDERR to dest=2
[01:2200] Handling cmd=SMPD_STDERR result
[01:2200] forward result SMPD_STDERR to dest=2
[01:2200] 1 -> 0 : returning parent_context: 0 < 1
[01:2200] forwarding command SMPD_EXIT to 0
[01:2200] posting command SMPD_EXIT to parent, src=2, dest=0.
[01:2200] handling command SMPD_SUSPEND src=0
[01:2200] suspending proc_id=0 succeeded, sending result to parent context
[01:2200] Handling cmd=SMPD_EXIT result
[01:2200] forward result SMPD_EXIT to dest=2
[01:2200] handling command SMPD_KILL src=0
[01:2200] process_id=0 rank=0 refcount=3, waiting for the process to finish exit
ing.
[01:2200] creating an exit command for process id=0  rank=0, pid=1564, exit code
=-1.
[01:2200] posting command SMPD_EXIT to parent, src=1, dest=0.
[01:2200] reading failed, assuming stdout is closed. error 0xc000014b
[01:2200] reading failed, assuming stderr is closed. error 0xc000014b
[01:2200] handling command SMPD_CLOSE from parent
[01:2200] sending close command to left child
[01:2200] Handling cmd=SMPD_EXIT result
[01:2200] cmd=SMPD_EXIT result will be handled locally
[01:2200] handling command SMPD_CLOSED src=2
[01:2200] 1 -> 2 : returning SMPD_CONTEXT_LEFT_CHILD
[01:2200] closed command received from left child.
[01:2200] closed context with error 1726.
[01:2200] sending 'closed' command to parent context
[01:2200] posting command SMPD_CLOSED to parent, src=1, dest=0.
[01:2200] Handling cmd=SMPD_CLOSED result
[01:2200] cmd=SMPD_CLOSED result will be handled locally
[01:2200] smpd manager successfully stopped listening.
[01:2200] SMPD exiting with error code 0.
0 Kudos
1 Reply
Reeves__Nathan
Beginner
794 Views

 

It turns out that hostnames (not just IP address's) need to be different between machines.

 

0 Kudos
Reply