Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Using MSMPI.dll - fortran crash forrtl 157

Reeves__Nathan
Beginner
1,168 Views

I am trying to get a simple example working across 2 amazon nodes. When I run locally, its all fine. Then when I use -machinefile to launch across both nodes, the sample application on the slave node throws an exception. forrtl 157.  I can run using -machinefile and just run both locally. And I can run using -machine file and just run everything on the remote node. But when I add both addresses to the hosts.txt, it seems the sample application on the slave node crashes.

Any pointers as to where I should start looking, thanks.

Here is my sample application:

  program hello
      include 'mpif.h'
      parameter (MASTER = 0)
 
      integer numtasks, taskid, len, ierr, rank, size,count
      character(MPI_MAX_PROCESSOR_NAME) hostname
      double precision data(100)
     integer status(MPI_STATUS_SIZE)
 
 
     
     
!      write (*,*) 'Starting'
     
      call MPI_INIT(ierr)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)
      call MPI_COMM_RANK(MPI_COMM_WORLD, taskid, ierr)
     
     call MPI_GET_PROCESSOR_NAME(hostname, len, ierr)
     write(*,20) taskid, hostname
 

       if (taskid .eq. MASTER) then
        write(*,30) numtasks
       end if
 
        do i=1, 10
          data(i) = (i * (taskid + 1))
        end do
 
        count = 10
        tag = 666
 
        ! All slaves send back data
!        if (taskid >= 1) then
       if (taskid .ne. MASTER) then
            call MPI_SEND( data, count, MPI_DOUBLE_PRECISION, 0, tag, MPI_COMM_WORLD, ierr )
            write  (*,50) taskid
        end if
       
        count = 10
       if (taskid .eq. MASTER) then
         ! receive from each slave
           do i=1, numtasks-1
           call MPI_RECV(data, count, MPI_DOUBLE_PRECISION, i , tag, MPI_COMM_WORLD, status, ierr )
           write (*,40) i
           do j=1, count
               write (*,*) data(j)
           end do
         end do
       end if
      
      
      call MPI_FINALIZE(ierr)
 
20    format('Hello from task ',I2,' on ',A48)
30    format('MASTER: Number of MPI tasks is: ',I2)
40    format('MASTER: Received from: ',I2)
50    format('SLAVE:',I2,' : Sending to Master')
 
      end
HERE is what i see on the console
Hello from task  0 on WIN-MSDP9MK1V14
MASTER: Number of MPI tasks is:  2
Hello from task  1 on WIN-MSDP9MK1V14
forrtl: severe (157): Program Exception - access violation
Image              PC                Routine            Line        Source
 
msmpi.dll          00007FFCE7018FA6  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE7018C68  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE7018492  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE7012768  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE70130C6  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE707E87F  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE707DE8C  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE70930A5  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE7092785  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE703B2DB  Unknown               Unknown  Unknown
msmpi.dll          00007FFCE70BD99B  Unknown               Unknown  Unknown
F-testapplication  00007FF785EB1240  Unknown               Unknown  Unknown
F-testapplication  00007FF785EB274E  Unknown               Unknown  Unknown
F-testapplication  00007FF785EB2B24  Unknown               Unknown  Unknown
KERNEL32.DLL       00007FFD038713D2  Unknown               Unknown  Unknown
ntdll.dll          00007FFD05CF5444  Unknown               Unknown  Unknown
 
job aborted:
[ranks] message
 
[0] terminated
 
[1] process exited without calling finalize
 
---- error analysis -----
 
[1] on 10.249.60.161
F-testapplication-nodebug.exe ended prematurely and may have crashed. exit code
157
 
---- error analysis -----
 
and this is what I see on the Master SMPD process:
[-1:1220] Launching SMPD service.
[-1:1220] smpd listening on port 8677
[-1:1220] Authentication completed. Successfully obtained Context for Client.
[-1:1220] version check complete, using PMP version 3.
[-1:1220] create manager process (using smpd daemon credentials)
[-1:1220] smpd reading the port string from the manager
[-1:2700] Launching smpd manager instance.
[-1:2700] created set for manager listener, 236
[-1:2700] smpd manager listening on port 49472
[-1:1220] closing the pipe to the manager
[-1:2700] Authentication completed. Successfully obtained Context for Client.
[-1:2700] Authorization completed.
[-1:2700] version check complete, using PMP version 3.
[-1:2700] Received session header from parent id=2, parent=1, level=1
[02:2700] Connecting back to parent using host WIN-MSDP9MK1V14 and endpoint 4953
6
[02:2700] Previous attempt failed, trying again with a resolved parent host 10.2
51.11.178:49536
[02:2700] Authentication completed. Successfully obtained Context for Client.
[02:2700] Authorization completed.
[02:2700] handling command SMPD_COLLECT src=0
[02:2700] handling command SMPD_LAUNCH src=0
[02:2700] Successfully handled bcast nodeids command.
[02:2700] setting environment variable: <MPIEXEC_HOSTNAME> = <WIN-MSDP9MK1V14>
[02:2700] env: PMI_SIZE=2
[02:2700] env: PMI_KVS=267b6f49-eac4-46e3-ad68-6ec4dd9d4e4a
[02:2700] env: PMI_DOMAIN=876a283b-5270-45bf-8a2b-812063bbae3e
[02:2700] env: PMI_HOST=localhost
[02:2700] env: PMI_PORT=6a4c59b6-9ba4-473f-bd95-54efa41e77c8
[02:2700] env: PMI_SMPD_ID=2
[02:2700] env: PMI_APPNUM=0
[02:2700] env: PMI_NODE_IDS=s
[02:2700] env: PMI_RANK_AFFINITIES=a
[02:2700] searching for 'F-testapplication-nodebug.exe' in workdir 'C:\Users\Adm
inistrator\Downloads\test'
[02:2700] C>CreateProcess(C:\Users\Administrator\Downloads\test\F-testapplicatio
n-nodebug.exe F-testapplication-nodebug.exe)
[02:2700] env: PMI_RANK=1
[02:2700] env: PMI_SMPD_KEY=0
[02:2700] Authentication completed. Successfully obtained Context for Client.
[02:2700] Authorization completed.
[02:2700] version check complete, using PMP version 3.
[02:2700] 2 -> 0 : returning parent_context: 0 < 2
[02:2700] forwarding command SMPD_INIT to 0
[02:2700] posting command SMPD_INIT to parent, src=2, ctx_key=0, dest=0.
[02:2700] Handling cmd=SMPD_INIT result
[02:2700] forward SMPD_INIT result to dest=2 ctx_key=0
[02:2700] 2 -> 1 : returning parent_context: 1 < 2
[02:2700] Caching business card for rank 1
[02:2700] forwarding command SMPD_BCPUT to 1
[02:2700] posting command SMPD_BCPUT to parent, src=2, ctx_key=0, dest=1.
[02:2700] Handling cmd=SMPD_BCPUT result
[02:2700] forward SMPD_BCPUT result to dest=2 ctx_key=0
[02:2700] handling command SMPD_BARRIER src=2 ctx_key=0
[02:2700] Handling SMPD_BARRIER src=2 ctx_key=0
[02:2700] initializing barrier(267b6f49-eac4-46e3-ad68-6ec4dd9d4e4a): in=1 size=
1
[02:2700] incrementing barrier(267b6f49-eac4-46e3-ad68-6ec4dd9d4e4a) incount fro
m 0 to 1 out of 1
[02:2700] all in barrier, sending barrier to parent.
[02:2700] posting command SMPD_BARRIER to parent, src=2, ctx_key=65535, dest=1.
[02:2700] Handling cmd=SMPD_BARRIER result
[02:2700] cmd=SMPD_BARRIER result will be handled locally
[02:2700] sending reply to barrier command '267b6f49-eac4-46e3-ad68-6ec4dd9d4e4a
'.
[02:2700] read 72 bytes from stdout
[02:2700] posting command SMPD_STDOUT to parent, src=2, dest=0.
[02:2700] 2 -> 1 : returning parent_context: 1 < 2
[02:2700] forwarding command SMPD_BCGET to 1
[02:2700] posting command SMPD_BCGET to parent, src=2, ctx_key=0, dest=1.
[02:2700] Handling cmd=SMPD_STDOUT result
[02:2700] cmd=SMPD_STDOUT result will be handled locally
[02:2700] Handling cmd=SMPD_BCGET result
[02:2700] forward SMPD_BCGET result to dest=2 ctx_key=0
[02:2700] Caching business card for rank 0
[02:2700] read 1024 bytes from stderr
[02:2700] posting command SMPD_STDERR to parent, src=2, dest=0.
[02:2700] read 358 bytes from stderr
[02:2700] posting command SMPD_STDERR to parent, src=2, dest=0.
[02:2700] Handling cmd=SMPD_STDERR result
[02:2700] cmd=SMPD_STDERR result will be handled locally
[02:2700] reading failed, assuming stdout is closed. error 0xc000014b
[02:2700] process_id=0 process refcount == 2, stdout closed.
[02:2700] reading failed, assuming stderr is closed. error 0xc000014b
[02:2700] process_id=0 process refcount == 1, stderr closed.
[02:2700] Handling cmd=SMPD_STDERR result
[02:2700] cmd=SMPD_STDERR result will be handled locally
[02:2700] process_id=0 process refcount == 0, pmi client closed.
[02:2700] process_id=0 rank=1 refcount=0, waiting for the process to finish exit
ing.
[02:2700] creating an exit command for process id=0  rank=1, pid=836, exit code=
157.
[02:2700] posting command SMPD_EXIT to parent, src=2, dest=0.
[02:2700] Handling cmd=SMPD_EXIT result
[02:2700] cmd=SMPD_EXIT result will be handled locally
[02:2700] handling command SMPD_CLOSE from parent
[02:2700] sending 'closed' command to parent context
[02:2700] posting command SMPD_CLOSED to parent, src=2, dest=1.
[02:2700] Handling cmd=SMPD_CLOSED result
[02:2700] cmd=SMPD_CLOSED result will be handled locally
[02:2700] smpd manager successfully stopped listening.
[02:2700] SMPD exiting with error code 0.
 
This is what I am seeing on the slave SMPD
[-1:736] Launching SMPD service.
[-1:736] smpd listening on port 8677
[-1:736] Authentication completed. Successfully obtained Context for Client.
[-1:736] version check complete, using PMP version 3.
[-1:736] create manager process (using smpd daemon credentials)
[-1:736] smpd reading the port string from the manager
[-1:2200] Launching smpd manager instance.
[-1:2200] created set for manager listener, 236
[-1:2200] smpd manager listening on port 49546
[-1:736] closing the pipe to the manager
[-1:2200] Authentication completed. Successfully obtained Context for Client.
[-1:2200] Authorization completed.
[-1:2200] version check complete, using PMP version 3.
[-1:2200] Received session header from parent id=1, parent=0, level=0
[01:2200] Connecting back to parent using host WIN-MSDP9MK1V14 and endpoint 4947
9
[01:2200] Previous attempt failed, trying again with a resolved parent host 10.2
49.60.161:49479
[01:2200] Authentication completed. Successfully obtained Context for Client.
[01:2200] Authorization completed.
[01:2200] handling command SMPD_CONNECT src=0
[01:2200] now connecting to 10.249.60.161
[01:2200] 1 -> 2 : returning SMPD_CONTEXT_LEFT_CHILD
[01:2200] using spn RestrictedKrbHost/10.249.60.161 to contact server
[01:2200] WIN-MSDP9MK1V14 posting a re-connect to 10.249.60.161:49483 in left ch
ild context.
[01:2200] Authentication completed. Successfully obtained Context for Client.
[01:2200] Authorization completed.
[01:2200] version check complete, using PMP version 3.
[01:2200] 1 -> 2 : returning SMPD_CONTEXT_LEFT_CHILD
[01:2200] handling command SMPD_COLLECT src=0
[01:2200] 1 -> 2 : returning left_context
[01:2200] forwarding command SMPD_COLLECT to 2
[01:2200] posting command SMPD_COLLECT to left child, src=0, dest=2.
[01:2200] Handling cmd=SMPD_COLLECT result
[01:2200] forward result SMPD_COLLECT to dest=0
[01:2200] handling command SMPD_STARTDBS src=0
[01:2200] sending start_dbs result command kvs = 67ef6493-0e3a-4e22-a346-857cf78
1526a.
[01:2200] handling command SMPD_LAUNCH src=0
[01:2200] Successfully handled bcast nodeids command.
[01:2200] setting environment variable: <MPIEXEC_HOSTNAME> = <WIN-MSDP9MK1V14>
[01:2200] env: PMI_SIZE=2
[01:2200] env: PMI_KVS=67ef6493-0e3a-4e22-a346-857cf781526a
[01:2200] env: PMI_DOMAIN=cd05bc63-8a53-4f72-b8d0-46fc66e1ed60
[01:2200] env: PMI_HOST=localhost
[01:2200] env: PMI_PORT=a21dfd59-26f5-487e-bc78-59ff341d4fac
[01:2200] env: PMI_SMPD_ID=1
[01:2200] env: PMI_APPNUM=0
[01:2200] env: PMI_NODE_IDS=s
[01:2200] env: PMI_RANK_AFFINITIES=a
[01:2200] searching for 'F-testapplication-nodebug.exe' in workdir 'C:\Users\Adm
inistrator\Downloads\test'
[01:2200] C>CreateProcess(C:\Users\Administrator\Downloads\test\F-testapplicatio
n-nodebug.exe F-testapplication-nodebug.exe)
[01:2200] env: PMI_RANK=0
[01:2200] env: PMI_SMPD_KEY=0
[01:2200] 1 -> 2 : returning left_context
[01:2200] forwarding command SMPD_LAUNCH to 2
[01:2200] posting command SMPD_LAUNCH to left child, src=0, dest=2.
[01:2200] Handling cmd=SMPD_LAUNCH result
[01:2200] forward result SMPD_LAUNCH to dest=0
[01:2200] Authentication completed. Successfully obtained Context for Client.
[01:2200] Authorization completed.
[01:2200] version check complete, using PMP version 3.
[01:2200] 1 -> 0 : returning parent_context: 0 < 1
[01:2200] forwarding command SMPD_INIT to 0
[01:2200] posting command SMPD_INIT to parent, src=1, ctx_key=0, dest=0.
[01:2200] Authentication completed. Successfully obtained Context for Client.
[01:2200] Authorization completed.
[01:2200] Handling cmd=SMPD_INIT result
[01:2200] forward SMPD_INIT result to dest=1 ctx_key=0
[01:2200] 1 -> 0 : returning parent_context: 0 < 1
[01:2200] forwarding command SMPD_INIT to 0
[01:2200] posting command SMPD_INIT to parent, src=2, ctx_key=0, dest=0.
[01:2200] handling command SMPD_BCPUT src=1 ctx_key=0
[01:2200] Handling SMPD_BCPUT command from smpd 1
        ctx_key=0
        rank=0
        value=port=49553 description="10.251.11.178 WIN-MSDP9MK1V14 " shm_host=W
IN-MSDP9MK1V14 shm_queue=1564:196
        result=success
[01:2200] handling command SMPD_BARRIER src=1 ctx_key=0
[01:2200] Handling SMPD_BARRIER src=1 ctx_key=0
[01:2200] initializing barrier(67ef6493-0e3a-4e22-a346-857cf781526a): in=1 size=
1
[01:2200] incrementing barrier(67ef6493-0e3a-4e22-a346-857cf781526a) incount fro
m 0 to 1 out of 2
[01:2200] Handling cmd=SMPD_INIT result
[01:2200] forward SMPD_INIT result to dest=2 ctx_key=0
[01:2200] handling command SMPD_BCPUT src=2 ctx_key=0
[01:2200] Handling SMPD_BCPUT command from smpd 2
        ctx_key=0
        rank=1
        value=port=49487 description="10.249.60.161 WIN-MSDP9MK1V14 " shm_host=W
IN-MSDP9MK1V14 shm_queue=2392:196
        result=success
[01:2200] handling command SMPD_BARRIER src=2 ctx_key=65535
[01:2200] Handling SMPD_BARRIER src=2 ctx_key=65535
[01:2200] incrementing barrier(67ef6493-0e3a-4e22-a346-857cf781526a) incount fro
m 1 to 2 out of 2
[01:2200] all in barrier, release the barrier.
[01:2200] sending reply to barrier command '67ef6493-0e3a-4e22-a346-857cf781526a
'.
[01:2200] sending reply to barrier command '67ef6493-0e3a-4e22-a346-857cf781526a
'.
[01:2200] read 72 bytes from stdout
[01:2200] posting command SMPD_STDOUT to parent, src=1, dest=0.
[01:2200] read 36 bytes from stdout
[01:2200] posting command SMPD_STDOUT to parent, src=1, dest=0.
[01:2200] Handling cmd=SMPD_STDOUT result
[01:2200] cmd=SMPD_STDOUT result will be handled locally
[01:2200] Handling cmd=SMPD_STDOUT result
[01:2200] cmd=SMPD_STDOUT result will be handled locally
[01:2200] 1 -> 0 : returning parent_context: 0 < 1
[01:2200] forwarding command SMPD_STDOUT to 0
[01:2200] posting command SMPD_STDOUT to parent, src=2, dest=0.
[01:2200] Handling cmd=SMPD_STDOUT result
[01:2200] forward result SMPD_STDOUT to dest=2
[01:2200] Authentication completed. Successfully obtained Context for Client.
[01:2200] Authorization completed.
[01:2200] handling command SMPD_BCGET src=2 ctx_key=0
[01:2200] Handling SMPD_BCGET command from smpd 2
        ctx_key=0
        rank=0
        value=port=49553 description="10.251.11.178 WIN-MSDP9MK1V14 " shm_host=W
IN-MSDP9MK1V14 shm_queue=1564:196
        result=success
[01:2200] 1 -> 0 : returning parent_context: 0 < 1
[01:2200] forwarding command SMPD_STDERR to 0
[01:2200] posting command SMPD_STDERR to parent, src=2, dest=0.
[01:2200] 1 -> 0 : returning parent_context: 0 < 1
[01:2200] forwarding command SMPD_STDERR to 0
[01:2200] posting command SMPD_STDERR to parent, src=2, dest=0.
[01:2200] Handling cmd=SMPD_STDERR result
[01:2200] forward result SMPD_STDERR to dest=2
[01:2200] Handling cmd=SMPD_STDERR result
[01:2200] forward result SMPD_STDERR to dest=2
[01:2200] 1 -> 0 : returning parent_context: 0 < 1
[01:2200] forwarding command SMPD_EXIT to 0
[01:2200] posting command SMPD_EXIT to parent, src=2, dest=0.
[01:2200] handling command SMPD_SUSPEND src=0
[01:2200] suspending proc_id=0 succeeded, sending result to parent context
[01:2200] Handling cmd=SMPD_EXIT result
[01:2200] forward result SMPD_EXIT to dest=2
[01:2200] handling command SMPD_KILL src=0
[01:2200] process_id=0 rank=0 refcount=3, waiting for the process to finish exit
ing.
[01:2200] creating an exit command for process id=0  rank=0, pid=1564, exit code
=-1.
[01:2200] posting command SMPD_EXIT to parent, src=1, dest=0.
[01:2200] reading failed, assuming stdout is closed. error 0xc000014b
[01:2200] reading failed, assuming stderr is closed. error 0xc000014b
[01:2200] handling command SMPD_CLOSE from parent
[01:2200] sending close command to left child
[01:2200] Handling cmd=SMPD_EXIT result
[01:2200] cmd=SMPD_EXIT result will be handled locally
[01:2200] handling command SMPD_CLOSED src=2
[01:2200] 1 -> 2 : returning SMPD_CONTEXT_LEFT_CHILD
[01:2200] closed command received from left child.
[01:2200] closed context with error 1726.
[01:2200] sending 'closed' command to parent context
[01:2200] posting command SMPD_CLOSED to parent, src=1, dest=0.
[01:2200] Handling cmd=SMPD_CLOSED result
[01:2200] cmd=SMPD_CLOSED result will be handled locally
[01:2200] smpd manager successfully stopped listening.
[01:2200] SMPD exiting with error code 0.
 
0 Kudos
11 Replies
Yuan_C_Intel
Employee
1,168 Views

Hi, nathanreeves

You can try add option "-g -traceback" to compile your code, so you that ifort runtime will report traceback information with file and source line to loate the access violation error.

Hope this helps.

Thanks.

0 Kudos
Reeves__Nathan
Beginner
1,168 Views

Thanks Yolanda.

I have built a debug version, but the only real debug that is produced is that it crashes on line 38,

call MPI_SEND( data, count, MPI_DOUBLE_PRECISION, 0, tag, MPI_COMM_WORLD, ierr )

Which as suspected, I am sure is on the slave node.  It does not give details of inside the MPI dll.

This works when I am running all the MPI tasks locally, but only seems to go wrong when spanned across different physical nodes.

I am a bit stuck where to go from here.

Thanks.

0 Kudos
Roman1
New Contributor I
1,168 Views

Hi,

I just had a quick look at your code.  The variable tag should be declared as type integer.  Since you are not using IMPLICIT NONE, and tag isn't declared, it is implicitly declared as real.  Maybe this is the problem.

Roman

 

0 Kudos
Reeves__Nathan
Beginner
1,168 Views

Hi Roman

Thanks. Well spotted and a good suggestion. But not the cause of the crash.  I not too sure how I can debug the msmpi.dll if at all. It works on a single machine, but crashes working across two.

Nathan

0 Kudos
Roman1
New Contributor I
1,168 Views

There is something confusing.  You are saying that if you run your program on a single node, everything works, and if you run it on several nodes, it crashes.  However, in the output you provided above, you have:

Hello from task 0 on WIN-MSDP9MK1V14
MASTER: Number of MPI tasks is: 2
Hello from task 1 on WIN-MSDP9MK1V14
forrtl: severe (157): Program Exception - access violation

The program crashed, even though both tasks are on the same node WIN-MSDP9MK1V14.  If the program was running on two different nodes, then the hostnames would be different.

 

0 Kudos
Yuan_C_Intel
Employee
1,168 Views

Hi, Nathan

I realized you are using MSMPI library. I think you need to consult MSMPI debug method.

As with Intel MPI we have I_MPI_debug environment variable to generated debug information for MPI application.

In addition, from your output it looks both tasks are still on the same node:

Hello from task  0 on WIN-MSDP9MK1V14
MASTER: Number of MPI tasks is:  2
Hello from task  1 on WIN-MSDP9MK1V14

Did I miss anything?

Thanks.

 

 

 

 

0 Kudos
Reeves__Nathan
Beginner
1,168 Views

Yes, sorry, the hostname is adding confusion. They are taken from the same Amazon snapshot so have the same name. I can see that the 2 machines are communicating by looking at the SMPD debug.

When they are working on one machine, I am just using mpiexec example.exe -n 2

and when running across 2 machines I add mpiexec example.exe -machine file hosts.txt -n 2

where hosts.txt has the IP addresses of both machines.

Also, if I remove the example.exe on the 'slave' or the one which will be rank 1, it does complain that the file can not be found.

As far as I can see, this should be a straight forward example, so I do not understand why it is crashing. I am using x64 and I am pretty sure I am linking to all the correct files etc. I can run on either machine by itself, but not communicating together. I can see that they are communicating from the SMPD debug, and the console prints. Its just MPI_SEND seems to make the slave always crash.

I'll see if I can add a MS MPI debug flag or something. The standard error functions do not report an error, and I wouldn't expect so either as its crashing before I get a chance to check for errors. The fact is MS MPI V7.1 would make me think this all works by now so it must be something I am doing wrong. I have tried disabling firewalls. But I am running out of ideas.

Thanks,

 

 

 

 

0 Kudos
Yuan_C_Intel
Employee
1,168 Views

Hi, Nathan

You can also consult Intel Cluster and HPC forum. There are experts who are familiar with MPI standard, programming and cluster configurations:

https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology

Hope this helps.

Thanks.

 

 

 

 

 

0 Kudos
Roman1
New Contributor I
1,168 Views

Have you tried using the -d option with mpiexec?  From running mpiexec -help2 :

-d [level]
-debug [level]
 Print debug output to stderr. Level is: 0=none, 1=error, 2=debug 3=both.
 When level is not specified '2=debug' is used.

Another thing you can experiment with, is to put two identical IP addresses into your hosts.txt file.

 

0 Kudos
Reeves__Nathan
Beginner
1,168 Views

Hi Yolanda

Thanks for the link. I have posted on there as well.

Hi Roman

I tried adding -d 3 as you suggested. I do get extra debug but all it really adds is the Hex dump of the std_err or std_out going from Node 1 back to node 0. The hex dump is just the stack track I sent. I tried adding error checking as well in the code, but that doesn't help as it actually just crashs and doesn't return an error.

I am pretty much out of ideas.

Thanks.

 

 

0 Kudos
Reeves__Nathan
Beginner
1,168 Views

Ok, Fixed !!

Thanks to those who came up with suggestions.

The problem was in fact that the hostnames were the same. These were the default set on Amazon for cloned nodes. So even though they had different IP address's and appeared to be able to communicate (i.e they could send stdout/stderr between master and slaves on different nodes), there were problems in the MSMPI.dlls  the moment I called MPI_SEND on the slave.

I think this problem may also exist on the Intel MPI dlls, as my first brief try to recompile using this also didn't work. I'll check later if it was the same problem.

Thanks,

Nathan

 

 

0 Kudos
Reply