Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

error on many cores run

Gabriele_B_
Beginner
2,991 Views

Hi

I have a code that works on a cluster when I use 6^3 = 216 cores, but the code crashes when I try to make it run with an higher resolution using a 12^3 = 1728 cores (all the parameters are the same except the grid spacing and the number of processors with which the code work).

We tried to see if it is a memory issue but even running the job with 16 tasks per nodes (108 nodes) didn't help.
I cannot debug the program with something like totalview because of the limit of processes these debuggers can manage.

I tried to compile the program with -O0 -g -traceback to get some better information in the error message.
When I add this options, even if the program crashes it runs until it expires the time I requested on the cluster.

In this case I get:
srun.slurm: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmstepd-borgt091: *** JOB 5787356 CANCELLED AT 2015-11-02T11:17:00 DUE TO TIME LIMIT on borgt091 ***
slurmstepd-borgt091: *** STEP 5787356.0 CANCELLED AT 2015-11-02T11:17:00 DUE TO TIME LIMIT on borgt091 ***
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
3dpic_full_mpi.ex  000000000088C169  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  000000000088AA3E  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000848F32  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000815663  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000819219  Unknown               Unknown  Unknown
libpthread.so.0    00002AAAAB669810  Unknown               Unknown  Unknown
libpthread.so.0    00002AAAAB6663D0  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
3dpic_full_mpi.ex  000000000088C169  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  000000000088AA3E  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000848F32  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000815663  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000819219  Unknown               Unknown  Unknown
libpthread.so.0    00002AAAAB669810  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000819140  Unknown               Unknown  Unknown
libpthread.so.0    00002AAAAB669810  Unknown               Unknown  Unknown
libmlx5-rdmav2.so  00002AAAACE3F4BB  Unknown               Unknown  Unknown

Stack trace terminated abnormally.

(more similar lines...)

I attach the complete error file (JOBID 5787356)

However, when I run the same simulation without the compiler options I get a different error and the job break down earlier:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
3dpic_full_mpi.ex  0000000000869189  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000867A5E  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000825B72  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  00000000007F2633  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  00000000007F621B  Unknown               Unknown  Unknown
libpthread.so.0    00002AAAAB669810  Unknown               Unknown  Unknown
libc.so.6          00002AAAAC126C52  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  00000000005389A2  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  00000000004A6643  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000462106  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  000000000041B72F  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  00000000004165C6  Unknown               Unknown  Unknown
libc.so.6          00002AAAAC02FC36  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  00000000004164B9  Unknown               Unknown  Unknown
srun.slurm: error: borgo015: task 0: Exited with exit code 174
MPT ERROR: borgo021 has had continuous IB fabric problems for 10
    (MPI_WATCHDOG_TIMER) minutes trying to reach borgo015. Aborting.
MPT ERROR: borgo020 has had continuous IB fabric problems for 10
    (MPI_WATCHDOG_TIMER) minutes trying to reach borgo015. Aborting.
MPT: Global rank 32 is aborting with error code 0.
     Process ID: 12240, Host: borgo021, Program: /gpfsm/dnb32/gbrambil/Kcode/pulsarSILOF/3dpic_full_mpi.exe

(other stuff later)

I attach the error file of this job too (JOBID 5991137)

Do you have any idea of what the problem could be? I saw this topic https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/558488 does it work for my case too (I cannot use a debugger like this guy)?

P.S: in the error file it appears this line rm: cannot remove `pcrimth.dat': No such file or directory. Don't worry about it, it always appears but  the code runs.

Thanks

0 Kudos
1 Solution
John_D_6
New Contributor I
2,991 Views

indeed your job 5991137 shows the 'real' error in the sense that the segfault on rank 0 is what needs to be fixed. The other ranks then just stop, because they can't contact that node (borgo015) anymore. Just extend the walltime of your job as long as is allowed on your cluster and rerun the job with debug info and -O0. If you can't wait that long, you can compile with -O2 -g -traceback and rerun. Note that in that case the printed location of the error is probably (much) less accurate.

View solution in original post

0 Kudos
15 Replies
TimP
Honored Contributor III
2,992 Views

This question appears to be related to MPI, but I don't see even a clue about which implementation of MPI. Each major implementation of MPI has its own email help list, except for Intel MPI (which I guess you aren't using) it would be the clusters and HPC companion forum to this one.

0 Kudos
Gabriele_B_
Beginner
2,992 Views

I use SGI-MPT MPI.

I wrote on this forum because I saw a similar question have been posed succesfully (the link I inserted in my post)

Thanks

GB

 

0 Kudos
TimP
Honored Contributor III
2,992 Views

It seems slightly related in that your cluster watchdog timer has killed your job, but it has explained that you have been waiting far to long to reach a specified node.  If that node was allocated to you by your cluster manager but is unavailable, that is a problem for your sysadmin.

0 Kudos
Gabriele_B_
Beginner
2,992 Views

But what about the fact that when I compile the program without -O0 -g -traceback I got a SIGSEGV error?

Which one of the two is the "correct" error? Why, with these compiling options, the cluster doesn't kill the job until the time expires?

Is WATCHDOG killing processes when the time expires?

Thanks

GB

0 Kudos
John_D_6
New Contributor I
2,992 Views

indeed your job 5991137 shows the 'real' error in the sense that the segfault on rank 0 is what needs to be fixed. The other ranks then just stop, because they can't contact that node (borgo015) anymore. Just extend the walltime of your job as long as is allowed on your cluster and rerun the job with debug info and -O0. If you can't wait that long, you can compile with -O2 -g -traceback and rerun. Note that in that case the printed location of the error is probably (much) less accurate.

0 Kudos
Gabriele_B_
Beginner
2,992 Views

Hi John,

thanks. Is task 0 equivalent to rank 0? Because if this is true it gives me some hints for the solution (the rank 0 job is doing something particular). 

Just extend the walltime of your job as long as is allowed on your cluster and rerun the job with debug info and -O0.

I tried but when I run the job in this way (debug info for me are -g -traceback, correct?) no matter the time, the simulation takes all the time but in reality remain blocked at the same point.

I'll try this one -O2 -g -traceback

Is there anything in this part of the error message below that can tell me something of what is happening to the job that has the problem?

MPT: --------stack traceback-------
MPT: Attaching to program: /proc/18777/exe, process 18777
MPT: Try: zypper install -C "debuginfo(build-id)=d4191084441e39a7b480fc4b41f67083812e9811"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=682e4f7a27ee294a58f17249a0717861db546f2d"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=32c1c7a7a20b54ac3af6b2f436b3375ffeb12f0b"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=2f51a06469a025d507534fe292dcf4e02235bd18"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=91334fd8105f0b62c0bdbbec14b45a9fd043f4c3"
MPT: (no debugging symbols found)...done.
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: Try: zypper install -C "debuginfo(build-id)=4aee0c3923838575483ebd16be6db85ecb6f0b75"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=3b149eccd897f1f37dce50ad22614043eba757a2"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=a95c0ce9f9752baf052d3b55b3b8ce19f662d2eb"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=719375f80fd84b85b905db2c20ec70e8805b36e5"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=4f3dc8efbe18b50a6abe70d8b2f862ce185542d6"
MPT: (no debugging symbols found)...done.
MPT: 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0
MPT: (gdb) #0  0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0
MPT: #1  0x00002aaaabd6f32c in mpi_sgi_system (command=<optimized out>) at sig.c:99
MPT: #2  MPI_SGI_stacktraceback (header=<optimized out>) at sig.c:319
MPT: #3  0x00002aaaabcc4a4a in print_traceback (ecode=0) at abort.c:197
MPT: #4  0x00002aaaabcc4b3e in MPI_SGI_abort () at abort.c:85
MPT: #5  0x00002aaaabd7ff4f in try_repush (gap=<optimized out>, 
MPT:     now=2886902.8844944318) at ud.c:1599
MPT: #6  0x00002aaaabd80eb6 in MPI_SGI_ud_progress () at ud.c:1769
MPT: #7  0x00002aaaabd57f62 in MPI_SGI_progress_devices () at progress.c:118
MPT: #8  MPI_SGI_progress () at progress.c:241
MPT: #9  0x00002aaaabd690d5 in MPI_SGI_request_wait (request=0x7fffffffa35c, 
MPT:     status=0x2aaaac010970 <mpi_sgi_status_ignore>, set=0x7fffffffa364, 
MPT:     gen_rc=0x7fffffffa360) at req.c:1703
MPT: #10 0x00002aaaabcd6d32 in MPI_SGI_barrier_basic (comm=6) at barrier.c:61
MPT: #11 0x00002aaaabcd6e65 in MPI_SGI_barrier (comm=6) at barrier.c:204
MPT: #12 0x00002aaaabcd70f5 in MPI_SGI_barrier (comm=<optimized out>)
MPT:     at barrier.c:166
MPT: #13 0x00002aaaabcd7193 in PMPI_Barrier (comm=1) at barrier.c:344
MPT: #14 0x00002aaaabcd72df in pmpi_barrier__ ()
MPT:    from /usr/local/sgi/mpi/mpt-2.12/opt/sgi/mpt/mpt-2.12/lib/libmpi.so
MPT: #15 0x000000000041bd07 in MAIN__ ()
MPT: #16 0x00000000004165c6 in main ()
MPT: (gdb) A debugging session is active.
MPT: 
MPT: 	Inferior 1 [process 18777] will be detached.
MPT: 
MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
MPT: Detaching from program: /proc/18777/exe, process 18777
MPT: Attaching to program: /proc/5232/exe, process 5232
MPT: Try: zypper install -C "debuginfo(build-id)=d4191084441e39a7b480fc4b41f67083812e9811"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=682e4f7a27ee294a58f17249a0717861db546f2d"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=32c1c7a7a20b54ac3af6b2f436b3375ffeb12f0b"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=2f51a06469a025d507534fe292dcf4e02235bd18"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=91334fd8105f0b62c0bdbbec14b45a9fd043f4c3"
MPT: (no debugging symbols found)...done.
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: Try: zypper install -C "debuginfo(build-id)=4aee0c3923838575483ebd16be6db85ecb6f0b75"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=3b149eccd897f1f37dce50ad22614043eba757a2"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=a95c0ce9f9752baf052d3b55b3b8ce19f662d2eb"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=719375f80fd84b85b905db2c20ec70e8805b36e5"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=4f3dc8efbe18b50a6abe70d8b2f862ce185542d6"
MPT: (no debugging symbols found)...done.
MPT: 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0
MPT: (gdb) #0  0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0
MPT: #1  0x00002aaaabd6f32c in mpi_sgi_system (command=<optimized out>) at sig.c:99
MPT: #2  MPI_SGI_stacktraceback (header=<optimized out>) at sig.c:319
MPT: #3  0x00002aaaabcc4a4a in print_traceback (ecode=0) at abort.c:197
MPT: #4  0x00002aaaabcc4b3e in MPI_SGI_abort () at abort.c:85
MPT: #5  0x00002aaaabd7ff4f in try_repush (gap=<optimized out>, 
MPT:     now=2886904.607016339) at ud.c:1599
MPT: #6  0x00002aaaabd80eb6 in MPI_SGI_ud_progress () at ud.c:1769
MPT: #7  0x00002aaaabd57f62 in MPI_SGI_progress_devices () at progress.c:118
MPT: #8  MPI_SGI_progress () at progress.c:241
MPT: #9  0x00002aaaabd690d5 in MPI_SGI_request_wait (request=0x7fffffffa3dc, 
MPT:     status=0x2aaaac010970 <mpi_sgi_status_ignore>, set=0x7fffffffa3e4, 
MPT:     gen_rc=0x7fffffffa3e0) at req.c:1703
MPT: #10 0x00002aaaabcd6d32 in MPI_SGI_barrier_basic (comm=6) at barrier.c:61
MPT: #11 0x00002aaaabcd6e65 in MPI_SGI_barrier (comm=6) at barrier.c:204
MPT: #12 0x00002aaaabcd70f5 in MPI_SGI_barrier (comm=<optimized out>)
MPT:     at barrier.c:166
MPT: #13 0x00002aaaabcd7193 in PMPI_Barrier (comm=1) at barrier.c:344
MPT: #14 0x00002aaaabcd72df in pmpi_barrier__ ()
MPT:    from /usr/local/sgi/mpi/mpt-2.12/opt/sgi/mpt/mpt-2.12/lib/libmpi.so
MPT: #15 0x000000000041bd07 in MAIN__ ()
MPT: #16 0x00000000004165c6 in main ()
MPT: (gdb) A debugging session is active.
MPT: 
MPT: 	Inferior 1 [process 5232] will be detached.
MPT: 
MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
MPT: Detaching from program: /proc/5232/exe, process 5232
MPT: Attaching to program: /proc/31976/exe, process 31976
MPT: Try: zypper install -C "debuginfo(build-id)=d4191084441e39a7b480fc4b41f67083812e9811"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=682e4f7a27ee294a58f17249a0717861db546f2d"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=32c1c7a7a20b54ac3af6b2f436b3375ffeb12f0b"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=2f51a06469a025d507534fe292dcf4e02235bd18"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=91334fd8105f0b62c0bdbbec14b45a9fd043f4c3"
MPT: (no debugging symbols found)...done.
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: Try: zypper install -C "debuginfo(build-id)=4aee0c3923838575483ebd16be6db85ecb6f0b75"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=3b149eccd897f1f37dce50ad22614043eba757a2"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=a95c0ce9f9752baf052d3b55b3b8ce19f662d2eb"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=719375f80fd84b85b905db2c20ec70e8805b36e5"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=4f3dc8efbe18b50a6abe70d8b2f862ce185542d6"
MPT: (no debugging symbols found)...done.
MPT: 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0
MPT: (gdb) #0  0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0
MPT: #1  0x00002aaaabd6f32c in mpi_sgi_system (command=<optimized out>) at sig.c:99
MPT: #2  MPI_SGI_stacktraceback (header=<optimized out>) at sig.c:319
MPT: #3  0x00002aaaabcc4a4a in print_traceback (ecode=0) at abort.c:197
MPT: #4  0x00002aaaabcc4b3e in MPI_SGI_abort () at abort.c:85
MPT: #5  0x00002aaaabd7ff4f in try_repush (gap=<optimized out>, 
MPT:     now=2886904.7364443871) at ud.c:1599
MPT: #6  0x00002aaaabd80eb6 in MPI_SGI_ud_progress () at ud.c:1769
MPT: #7  0x00002aaaabd57f62 in MPI_SGI_progress_devices () at progress.c:118
MPT: #8  MPI_SGI_progress () at progress.c:241
MPT: #9  0x00002aaaabd690d5 in MPI_SGI_request_wait (request=0x7fffffffa3dc, 
MPT:     status=0x2aaaac010970 <mpi_sgi_status_ignore>, set=0x7fffffffa3e4, 
MPT:     gen_rc=0x7fffffffa3e0) at req.c:1703
MPT: #10 0x00002aaaabcd6d32 in MPI_SGI_barrier_basic (comm=6) at barrier.c:61
MPT: #11 0x00002aaaabcd6e65 in MPI_SGI_barrier (comm=6) at barrier.c:204
MPT: #12 0x00002aaaabcd70f5 in MPI_SGI_barrier (comm=<optimized out>)
MPT:     at barrier.c:166
MPT: #13 0x00002aaaabcd7193 in PMPI_Barrier (comm=1) at barrier.c:344
MPT: #14 0x00002aaaabcd72df in pmpi_barrier__ ()
MPT:    from /usr/local/sgi/mpi/mpt-2.12/opt/sgi/mpt/mpt-2.12/lib/libmpi.so
MPT: #15 0x000000000041bd07 in MAIN__ ()
MPT: #16 0x00000000004165c6 in main ()
MPT: (gdb) A debugging session is active.
MPT: 
MPT: 	Inferior 1 [process 31976] will be detached.
MPT: 
MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
MPT: Detaching from program: /proc/31976/exe, process 31976
MPT: Attaching to program: /proc/24838/exe, process 24838
MPT: Try: zypper install -C "debuginfo(build-id)=d4191084441e39a7b480fc4b41f67083812e9811"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=682e4f7a27ee294a58f17249a0717861db546f2d"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=32c1c7a7a20b54ac3af6b2f436b3375ffeb12f0b"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=2f51a06469a025d507534fe292dcf4e02235bd18"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=91334fd8105f0b62c0bdbbec14b45a9fd043f4c3"
MPT: (no debugging symbols found)...done.
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: Try: zypper install -C "debuginfo(build-id)=4aee0c3923838575483ebd16be6db85ecb6f0b75"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=3b149eccd897f1f37dce50ad22614043eba757a2"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=a95c0ce9f9752baf052d3b55b3b8ce19f662d2eb"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=719375f80fd84b85b905db2c20ec70e8805b36e5"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=4f3dc8efbe18b50a6abe70d8b2f862ce185542d6"
MPT: (no debugging symbols found)...done.
MPT: 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0
MPT: (gdb) #0  0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0
MPT: #1  0x00002aaaabd6f32c in mpi_sgi_system (command=<optimized out>) at sig.c:99
MPT: #2  MPI_SGI_stacktraceback (header=<optimized out>) at sig.c:319
MPT: #3  0x00002aaaabcc4a4a in print_traceback (ecode=0) at abort.c:197
MPT: #4  0x00002aaaabcc4b3e in MPI_SGI_abort () at abort.c:85
MPT: #5  0x00002aaaabd7ff4f in try_repush (gap=<optimized out>, 
MPT:     now=2886899.4397963542) at ud.c:1599
MPT: #6  0x00002aaaabd80eb6 in MPI_SGI_ud_progress () at ud.c:1769
MPT: #7  0x00002aaaabd57f62 in MPI_SGI_progress_devices () at progress.c:118
MPT: #8  MPI_SGI_progress () at progress.c:241
MPT: #9  0x00002aaaabd690d5 in MPI_SGI_request_wait (request=0x7fffffffa324, 
MPT:     status=0x2aaaac010970 <mpi_sgi_status_ignore>, set=0x7fffffffa320, 
MPT:     gen_rc=0x7fffffffa31c) at req.c:1703
MPT: #10 0x00002aaaabd7324d in MPI_SGI_recv (buf=<optimized out>, 
MPT:     count=<optimized out>, type=<optimized out>, des=<optimized out>, 
MPT:     tag=<optimized out>, comm=<optimized out>, 
MPT:     status=0x2aaaac010970 <mpi_sgi_status_ignore>) at sugar.c:40
MPT: #11 0x00002aaaabcd6df3 in MPI_SGI_barrier_basic (comm=6) at barrier.c:74
MPT: #12 0x00002aaaabcd6e65 in MPI_SGI_barrier (comm=6) at barrier.c:204
MPT: #13 0x00002aaaabcd70f5 in MPI_SGI_barrier (comm=<optimized out>)
MPT:     at barrier.c:166
MPT: #14 0x00002aaaabcd7193 in PMPI_Barrier (comm=1) at barrier.c:344
MPT: #15 0x00002aaaabcd72df in pmpi_barrier__ ()
MPT:    from /usr/local/sgi/mpi/mpt-2.12/opt/sgi/mpt/mpt-2.12/lib/libmpi.so
MPT: #16 0x000000000041bd07 in MAIN__ ()
MPT: #17 0x00000000004165c6 in main ()
MPT: (gdb) A debugging session is active.
MPT: 
MPT: 	Inferior 1 [process 24838] will be detached.
MPT: 
MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
MPT: Detaching from program: /proc/24838/exe, process 24838
MPT: Attaching to program: /proc/12467/exe, process 12467
MPT: Try: zypper install -C "debuginfo(build-id)=d4191084441e39a7b480fc4b41f67083812e9811"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=682e4f7a27ee294a58f17249a0717861db546f2d"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=32c1c7a7a20b54ac3af6b2f436b3375ffeb12f0b"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=2f51a06469a025d507534fe292dcf4e02235bd18"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=91334fd8105f0b62c0bdbbec14b45a9fd043f4c3"
MPT: (no debugging symbols found)...done.
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: Try: zypper install -C "debuginfo(build-id)=4aee0c3923838575483ebd16be6db85ecb6f0b75"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=3b149eccd897f1f37dce50ad22614043eba757a2"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=a95c0ce9f9752baf052d3b55b3b8ce19f662d2eb"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=719375f80fd84b85b905db2c20ec70e8805b36e5"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=4f3dc8efbe18b50a6abe70d8b2f862ce185542d6"
MPT: (no debugging symbols found)...done.
MPT: 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0
MPT: (gdb) #0  0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0
MPT: #1  0x00002aaaabd6f32c in mpi_sgi_system (command=<optimized out>) at sig.c:99
MPT: #2  MPI_SGI_stacktraceback (header=<optimized out>) at sig.c:319
MPT: #3  0x00002aaaabcc4a4a in print_traceback (ecode=0) at abort.c:197
MPT: #4  0x00002aaaabcc4b3e in MPI_SGI_abort () at abort.c:85
MPT: #5  0x00002aaaabd7ff4f in try_repush (gap=<optimized out>, 
MPT:     now=2886904.2856587609) at ud.c:1599
MPT: #6  0x00002aaaabd80eb6 in MPI_SGI_ud_progress () at ud.c:1769
MPT: #7  0x00002aaaabd57f62 in MPI_SGI_progress_devices () at progress.c:118
MPT: #8  MPI_SGI_progress () at progress.c:241
MPT: #9  0x00002aaaabd690d5 in MPI_SGI_request_wait (request=0x7fffffffa35c, 
MPT:     status=0x2aaaac010970 <mpi_sgi_status_ignore>, set=0x7fffffffa364, 
MPT:     gen_rc=0x7fffffffa360) at req.c:1703
MPT: #10 0x00002aaaabcd6d32 in MPI_SGI_barrier_basic (comm=6) at barrier.c:61
MPT: #11 0x00002aaaabcd6e65 in MPI_SGI_barrier (comm=6) at barrier.c:204
MPT: #12 0x00002aaaabcd70f5 in MPI_SGI_barrier (comm=<optimized out>)
MPT:     at barrier.c:166
MPT: #13 0x00002aaaabcd7193 in PMPI_Barrier (comm=1) at barrier.c:344
MPT: #14 0x00002aaaabcd72df in pmpi_barrier__ ()
MPT:    from /usr/local/sgi/mpi/mpt-2.12/opt/sgi/mpt/mpt-2.12/lib/libmpi.so
MPT: #15 0x000000000041bd07 in MAIN__ ()
MPT: #16 0x00000000004165c6 in main ()
MPT: (gdb) A debugging session is active.
MPT: 
MPT: 	Inferior 1 [process 12467] will be detached.
MPT: 
MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
MPT: Detaching from program: /proc/12467/exe, process 12467
MPT: Attaching to program: /proc/12240/exe, process 12240
MPT: Try: zypper install -C "debuginfo(build-id)=d4191084441e39a7b480fc4b41f67083812e9811"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=682e4f7a27ee294a58f17249a0717861db546f2d"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=32c1c7a7a20b54ac3af6b2f436b3375ffeb12f0b"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=2f51a06469a025d507534fe292dcf4e02235bd18"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=91334fd8105f0b62c0bdbbec14b45a9fd043f4c3"
MPT: (no debugging symbols found)...done.
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: Try: zypper install -C "debuginfo(build-id)=4aee0c3923838575483ebd16be6db85ecb6f0b75"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=3b149eccd897f1f37dce50ad22614043eba757a2"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=a95c0ce9f9752baf052d3b55b3b8ce19f662d2eb"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=719375f80fd84b85b905db2c20ec70e8805b36e5"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=4f3dc8efbe18b50a6abe70d8b2f862ce185542d6"
MPT: (no debugging symbols found)...done.
MPT: 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0
MPT: (gdb) #0  0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0
MPT: #1  0x00002aaaabd6f32c in mpi_sgi_system (command=<optimized out>) at sig.c:99
MPT: #2  MPI_SGI_stacktraceback (header=<optimized out>) at sig.c:319
MPT: #3  0x00002aaaabcc4a4a in print_traceback (ecode=0) at abort.c:197
MPT: #4  0x00002aaaabcc4b3e in MPI_SGI_abort () at abort.c:85
MPT: #5  0x00002aaaabd7ff4f in try_repush (gap=<optimized out>, 
MPT:     now=2886905.1025811359) at ud.c:1599
MPT: #6  0x00002aaaabd80eb6 in MPI_SGI_ud_progress () at ud.c:1769
MPT: #7  0x00002aaaabd57f62 in MPI_SGI_progress_devices () at progress.c:118
MPT: #8  MPI_SGI_progress () at progress.c:241
MPT: #9  0x00002aaaabd690d5 in MPI_SGI_request_wait (request=0x7fffffffa3dc, 
MPT:     status=0x2aaaac010970 <mpi_sgi_status_ignore>, set=0x7fffffffa3e4, 
MPT:     gen_rc=0x7fffffffa3e0) at req.c:1703
MPT: #10 0x00002aaaabcd6d32 in MPI_SGI_barrier_basic (comm=6) at barrier.c:61
MPT: #11 0x00002aaaabcd6e65 in MPI_SGI_barrier (comm=6) at barrier.c:204
MPT: #12 0x00002aaaabcd70f5 in MPI_SGI_barrier (comm=<optimized out>)
MPT:     at barrier.c:166
MPT: #13 0x00002aaaabcd7193 in PMPI_Barrier (comm=1) at barrier.c:344
MPT: #14 0x00002aaaabcd72df in pmpi_barrier__ ()
MPT:    from /usr/local/sgi/mpi/mpt-2.12/opt/sgi/mpt/mpt-2.12/lib/libmpi.so
MPT: #15 0x000000000041bd07 in MAIN__ ()
MPT: #16 0x00000000004165c6 in main ()
MPT: (gdb) A debugging session is active.
MPT: 
MPT: 	Inferior 1 [process 12240] will be detached.
MPT: 
MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
MPT: Detaching from program: /proc/12240/exe, process 12240
MPT: Attaching to program: /proc/15596/exe, process 15596
MPT: Try: zypper install -C "debuginfo(build-id)=d4191084441e39a7b480fc4b41f67083812e9811"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=682e4f7a27ee294a58f17249a0717861db546f2d"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=32c1c7a7a20b54ac3af6b2f436b3375ffeb12f0b"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=2f51a06469a025d507534fe292dcf4e02235bd18"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=91334fd8105f0b62c0bdbbec14b45a9fd043f4c3"
MPT: (no debugging symbols found)...done.
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: Try: zypper install -C "debuginfo(build-id)=4aee0c3923838575483ebd16be6db85ecb6f0b75"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=3b149eccd897f1f37dce50ad22614043eba757a2"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=a95c0ce9f9752baf052d3b55b3b8ce19f662d2eb"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=719375f80fd84b85b905db2c20ec70e8805b36e5"
MPT: (no debugging symbols found)...done.
MPT: Try: zypper install -C "debuginfo(build-id)=4f3dc8efbe18b50a6abe70d8b2f862ce185542d6"
MPT: (no debugging symbols found)...done.
MPT: 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0
MPT: (gdb) #0  0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0
MPT: #1  0x00002aaaabd6f32c in mpi_sgi_system (command=<optimized out>) at sig.c:99
MPT: #2  MPI_SGI_stacktraceback (header=<optimized out>) at sig.c:319
MPT: #3  0x00002aaaabcc4a4a in print_traceback (ecode=0) at abort.c:197
MPT: #4  0x00002aaaabcc4b3e in MPI_SGI_abort () at abort.c:85
MPT: #5  0x00002aaaabd7ff4f in try_repush (gap=<optimized out>, 
MPT:     now=2886904.6996221012) at ud.c:1599
MPT: #6  0x00002aaaabd80eb6 in MPI_SGI_ud_progress () at ud.c:1769
MPT: #7  0x00002aaaabd57f62 in MPI_SGI_progress_devices () at progress.c:118
MPT: #8  MPI_SGI_progress () at progress.c:241
MPT: #9  0x00002aaaabd690d5 in MPI_SGI_request_wait (request=0x7fffffffa35c, 
MPT:     status=0x2aaaac010970 <mpi_sgi_status_ignore>, set=0x7fffffffa364, 
MPT:     gen_rc=0x7fffffffa360) at req.c:1703
MPT: #10 0x00002aaaabcd6d32 in MPI_SGI_barrier_basic (comm=6) at barrier.c:61
MPT: #11 0x00002aaaabcd6e65 in MPI_SGI_barrier (comm=6) at barrier.c:204
MPT: #12 0x00002aaaabcd70f5 in MPI_SGI_barrier (comm=<optimized out>)
MPT:     at barrier.c:166
MPT: #13 0x00002aaaabcd7193 in PMPI_Barrier (comm=1) at barrier.c:344
MPT: #14 0x00002aaaabcd72df in pmpi_barrier__ ()
MPT:    from /usr/local/sgi/mpi/mpt-2.12/opt/sgi/mpt/mpt-2.12/lib/libmpi.so
MPT: #15 0x000000000041bd07 in MAIN__ ()
MPT: #16 0x00000000004165c6 in main ()
MPT: (gdb) A debugging session is active.
MPT: 
MPT: 	Inferior 1 [process 15596] will be detached.
MPT: 
MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
MPT: Detaching from program: /proc/15596/exe, process 15596

MPT: -----stack traceback ends-----

MPT: -----stack traceback ends-----

MPT: -----stack traceback ends-----

MPT: -----stack traceback ends-----

MPT: -----stack traceback ends-----

MPT: -----stack traceback ends-----

 

 

0 Kudos
John_D_6
New Contributor I
2,992 Views

yes, rank 0 and task 0 are the same thing.

no, it doesn't tell you too much: just that all other ranks wait in an MPI_Barrier, presumably until rank 0 finishes its particularities. Of course, they'll wait forever (or until the timeout), because rank 0 exited with an error..

0 Kudos
Gabriele_B_
Beginner
2,992 Views

Hey John,

I tried this -O2 and it gave me the line where the SIGSEGV is happening

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
3dpic_full_mpi.ex  0000000000862859  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  000000000086112E  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  000000000081F242  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  00000000007EBD03  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  00000000007EF8EB  Unknown               Unknown  Unknown
libpthread.so.0    00002AAAAB669810  Unknown               Unknown  Unknown
libc.so.6          00002AAAAC126C52  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  0000000000532072  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  000000000049FD13  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  000000000045B7D6  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  000000000041B001  MAIN__                    545  3dpic_full_mpi.f
3dpic_full_mpi.ex  00000000004165C6  Unknown               Unknown  Unknown
libc.so.6          00002AAAAC02FC36  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  00000000004164B9  Unknown               Unknown  Unknown
srun.slurm: error: borgo010: task 0: Exited with exit code 174

It's what I wanted. Thanks

GB

 

 

0 Kudos
John_D_6
New Contributor I
2,992 Views

Hello Gabriele,

good to see that you got some location back. Note, that the location is inaccurate because of optimization, so it might be 10 lines earlier or later as well. Furthermore, it seems that you just compiled the main routine with debugging info. There seem to be 3 more frames in the callstack that are part of your program

3dpic_full_mpi.ex  0000000000532072  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  000000000049FD13  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  000000000045B7D6  Unknown               Unknown  Unknown
3dpic_full_mpi.ex  000000000041B001  MAIN__                    545  3dpic_full_mpi.f

I suggest to recompile all routines and rerun your simulation to see those frames as well

0 Kudos
Gabriele_B_
Beginner
2,992 Views

What do you mean for compiling all the routines? at the moment I'm compiling the code in this way

set -x

mpif90 -I/usr/local/other/SLES11.3/ncl/gcc-4.3.4/6.3.0-static/include -I/usr/local/other/SLES11.3/silo/4.10.2/include -extend-source -r8 -c -O2 -g -traceback 3dpic_full_mpi.f

mpif90 -o 3dpic_full_mpi.exe 3dpic_full_mpi.o -L/usr/local/other/SLES11.3/ncl/gcc-4.3.4/6.3.0-static/lib -L/usr/local/other/SLES11.3/silo/4.10.2/lib -lsiloh5 -lhdf5_hl -lhdf5 -lsz -lz -lm -lrt -ldl -lstdc++ -O2 -g -traceback

I have many subroutines in the code. I have only 2 other files files, but they are constants and variables and I include them in the .f file showed above.

Thanks

GB

0 Kudos
John_D_6
New Contributor I
2,992 Views

my hunch is that your code crashes in a call to the silo or hdf5 library, since those are not compiled with debug info.

0 Kudos
Gabriele_B_
Beginner
2,992 Views

I think you are quiet right, and I should have found the error (I know my code, so I know that's SILO the critical part).

-O2 this time was very precise: that is the exact line where the code breaks.

How does -O2 work? How can I know how much to trust it? You said it has a 10 lines tolerance, but if in the future I insert some useless lines with dum operations does it still maintain this 10 lines sensibility?

thanks

GB

 

 

0 Kudos
John_D_6
New Contributor I
2,992 Views

the 10 lines I mentioned is just a number I made up: it could be more or less, just don't rely on it being exact. The compiler could move around statements, so there is only a loose correspondence between the order in your source code and the order in the executable.

In this case, the compiler can't optimize the call to an external library, that's why you see the exact location.

0 Kudos
dkokron
Beginner
2,992 Views

GB,

I gather from the name of the compute node (borg...) that you are running on the discover cluster at NASA/Goddard.  You should contact the NCCS support group at support@nccs.nasa.gov.

Dan

0 Kudos
Gabriele_B_
Beginner
2,992 Views

Dan,

I had done it. But John suggestion have been more helpful.

GB

0 Kudos
Reply