- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I have a code that works on a cluster when I use 6^3 = 216 cores, but the code crashes when I try to make it run with an higher resolution using a 12^3 = 1728 cores (all the parameters are the same except the grid spacing and the number of processors with which the code work).
We tried to see if it is a memory issue but even running the job with 16 tasks per nodes (108 nodes) didn't help.
I cannot debug the program with something like totalview because of the limit of processes these debuggers can manage.
I tried to compile the program with -O0 -g -traceback to get some better information in the error message.
When I add this options, even if the program crashes it runs until it expires the time I requested on the cluster.
In this case I get:
srun.slurm: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmstepd-borgt091: *** JOB 5787356 CANCELLED AT 2015-11-02T11:17:00 DUE TO TIME LIMIT on borgt091 ***
slurmstepd-borgt091: *** STEP 5787356.0 CANCELLED AT 2015-11-02T11:17:00 DUE TO TIME LIMIT on borgt091 ***
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
3dpic_full_mpi.ex 000000000088C169 Unknown Unknown Unknown
3dpic_full_mpi.ex 000000000088AA3E Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000848F32 Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000815663 Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000819219 Unknown Unknown Unknown
libpthread.so.0 00002AAAAB669810 Unknown Unknown Unknown
libpthread.so.0 00002AAAAB6663D0 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
3dpic_full_mpi.ex 000000000088C169 Unknown Unknown Unknown
3dpic_full_mpi.ex 000000000088AA3E Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000848F32 Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000815663 Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000819219 Unknown Unknown Unknown
libpthread.so.0 00002AAAAB669810 Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000819140 Unknown Unknown Unknown
libpthread.so.0 00002AAAAB669810 Unknown Unknown Unknown
libmlx5-rdmav2.so 00002AAAACE3F4BB Unknown Unknown Unknown
Stack trace terminated abnormally.
(more similar lines...)
I attach the complete error file (JOBID 5787356)
However, when I run the same simulation without the compiler options I get a different error and the job break down earlier:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
3dpic_full_mpi.ex 0000000000869189 Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000867A5E Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000825B72 Unknown Unknown Unknown
3dpic_full_mpi.ex 00000000007F2633 Unknown Unknown Unknown
3dpic_full_mpi.ex 00000000007F621B Unknown Unknown Unknown
libpthread.so.0 00002AAAAB669810 Unknown Unknown Unknown
libc.so.6 00002AAAAC126C52 Unknown Unknown Unknown
3dpic_full_mpi.ex 00000000005389A2 Unknown Unknown Unknown
3dpic_full_mpi.ex 00000000004A6643 Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000462106 Unknown Unknown Unknown
3dpic_full_mpi.ex 000000000041B72F Unknown Unknown Unknown
3dpic_full_mpi.ex 00000000004165C6 Unknown Unknown Unknown
libc.so.6 00002AAAAC02FC36 Unknown Unknown Unknown
3dpic_full_mpi.ex 00000000004164B9 Unknown Unknown Unknown
srun.slurm: error: borgo015: task 0: Exited with exit code 174
MPT ERROR: borgo021 has had continuous IB fabric problems for 10
(MPI_WATCHDOG_TIMER) minutes trying to reach borgo015. Aborting.
MPT ERROR: borgo020 has had continuous IB fabric problems for 10
(MPI_WATCHDOG_TIMER) minutes trying to reach borgo015. Aborting.
MPT: Global rank 32 is aborting with error code 0.
Process ID: 12240, Host: borgo021, Program: /gpfsm/dnb32/gbrambil/Kcode/pulsarSILOF/3dpic_full_mpi.exe
(other stuff later)
I attach the error file of this job too (JOBID 5991137)
Do you have any idea of what the problem could be? I saw this topic https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/558488 does it work for my case too (I cannot use a debugger like this guy)?
P.S: in the error file it appears this line rm: cannot remove `pcrimth.dat': No such file or directory. Don't worry about it, it always appears but the code runs.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
indeed your job 5991137 shows the 'real' error in the sense that the segfault on rank 0 is what needs to be fixed. The other ranks then just stop, because they can't contact that node (borgo015) anymore. Just extend the walltime of your job as long as is allowed on your cluster and rerun the job with debug info and -O0. If you can't wait that long, you can compile with -O2 -g -traceback and rerun. Note that in that case the printed location of the error is probably (much) less accurate.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This question appears to be related to MPI, but I don't see even a clue about which implementation of MPI. Each major implementation of MPI has its own email help list, except for Intel MPI (which I guess you aren't using) it would be the clusters and HPC companion forum to this one.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I use SGI-MPT MPI.
I wrote on this forum because I saw a similar question have been posed succesfully (the link I inserted in my post)
Thanks
GB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems slightly related in that your cluster watchdog timer has killed your job, but it has explained that you have been waiting far to long to reach a specified node. If that node was allocated to you by your cluster manager but is unavailable, that is a problem for your sysadmin.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But what about the fact that when I compile the program without -O0 -g -traceback I got a SIGSEGV error?
Which one of the two is the "correct" error? Why, with these compiling options, the cluster doesn't kill the job until the time expires?
Is WATCHDOG killing processes when the time expires?
Thanks
GB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
indeed your job 5991137 shows the 'real' error in the sense that the segfault on rank 0 is what needs to be fixed. The other ranks then just stop, because they can't contact that node (borgo015) anymore. Just extend the walltime of your job as long as is allowed on your cluster and rerun the job with debug info and -O0. If you can't wait that long, you can compile with -O2 -g -traceback and rerun. Note that in that case the printed location of the error is probably (much) less accurate.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi John,
thanks. Is task 0 equivalent to rank 0? Because if this is true it gives me some hints for the solution (the rank 0 job is doing something particular).
Just extend the walltime of your job as long as is allowed on your cluster and rerun the job with debug info and -O0.
I tried but when I run the job in this way (debug info for me are -g -traceback, correct?) no matter the time, the simulation takes all the time but in reality remain blocked at the same point.
I'll try this one -O2 -g -traceback
Is there anything in this part of the error message below that can tell me something of what is happening to the job that has the problem?
MPT: --------stack traceback------- MPT: Attaching to program: /proc/18777/exe, process 18777 MPT: Try: zypper install -C "debuginfo(build-id)=d4191084441e39a7b480fc4b41f67083812e9811" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=682e4f7a27ee294a58f17249a0717861db546f2d" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=32c1c7a7a20b54ac3af6b2f436b3375ffeb12f0b" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=2f51a06469a025d507534fe292dcf4e02235bd18" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=91334fd8105f0b62c0bdbbec14b45a9fd043f4c3" MPT: (no debugging symbols found)...done. MPT: [Thread debugging using libthread_db enabled] MPT: Using host libthread_db library "/lib64/libthread_db.so.1". MPT: Try: zypper install -C "debuginfo(build-id)=4aee0c3923838575483ebd16be6db85ecb6f0b75" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=3b149eccd897f1f37dce50ad22614043eba757a2" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=a95c0ce9f9752baf052d3b55b3b8ce19f662d2eb" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=719375f80fd84b85b905db2c20ec70e8805b36e5" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=4f3dc8efbe18b50a6abe70d8b2f862ce185542d6" MPT: (no debugging symbols found)...done. MPT: 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0 MPT: (gdb) #0 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0 MPT: #1 0x00002aaaabd6f32c in mpi_sgi_system (command=<optimized out>) at sig.c:99 MPT: #2 MPI_SGI_stacktraceback (header=<optimized out>) at sig.c:319 MPT: #3 0x00002aaaabcc4a4a in print_traceback (ecode=0) at abort.c:197 MPT: #4 0x00002aaaabcc4b3e in MPI_SGI_abort () at abort.c:85 MPT: #5 0x00002aaaabd7ff4f in try_repush (gap=<optimized out>, MPT: now=2886902.8844944318) at ud.c:1599 MPT: #6 0x00002aaaabd80eb6 in MPI_SGI_ud_progress () at ud.c:1769 MPT: #7 0x00002aaaabd57f62 in MPI_SGI_progress_devices () at progress.c:118 MPT: #8 MPI_SGI_progress () at progress.c:241 MPT: #9 0x00002aaaabd690d5 in MPI_SGI_request_wait (request=0x7fffffffa35c, MPT: status=0x2aaaac010970 <mpi_sgi_status_ignore>, set=0x7fffffffa364, MPT: gen_rc=0x7fffffffa360) at req.c:1703 MPT: #10 0x00002aaaabcd6d32 in MPI_SGI_barrier_basic (comm=6) at barrier.c:61 MPT: #11 0x00002aaaabcd6e65 in MPI_SGI_barrier (comm=6) at barrier.c:204 MPT: #12 0x00002aaaabcd70f5 in MPI_SGI_barrier (comm=<optimized out>) MPT: at barrier.c:166 MPT: #13 0x00002aaaabcd7193 in PMPI_Barrier (comm=1) at barrier.c:344 MPT: #14 0x00002aaaabcd72df in pmpi_barrier__ () MPT: from /usr/local/sgi/mpi/mpt-2.12/opt/sgi/mpt/mpt-2.12/lib/libmpi.so MPT: #15 0x000000000041bd07 in MAIN__ () MPT: #16 0x00000000004165c6 in main () MPT: (gdb) A debugging session is active. MPT: MPT: Inferior 1 [process 18777] will be detached. MPT: MPT: Quit anyway? (y or n) [answered Y; input not from terminal] MPT: Detaching from program: /proc/18777/exe, process 18777 MPT: Attaching to program: /proc/5232/exe, process 5232 MPT: Try: zypper install -C "debuginfo(build-id)=d4191084441e39a7b480fc4b41f67083812e9811" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=682e4f7a27ee294a58f17249a0717861db546f2d" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=32c1c7a7a20b54ac3af6b2f436b3375ffeb12f0b" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=2f51a06469a025d507534fe292dcf4e02235bd18" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=91334fd8105f0b62c0bdbbec14b45a9fd043f4c3" MPT: (no debugging symbols found)...done. MPT: [Thread debugging using libthread_db enabled] MPT: Using host libthread_db library "/lib64/libthread_db.so.1". MPT: Try: zypper install -C "debuginfo(build-id)=4aee0c3923838575483ebd16be6db85ecb6f0b75" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=3b149eccd897f1f37dce50ad22614043eba757a2" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=a95c0ce9f9752baf052d3b55b3b8ce19f662d2eb" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=719375f80fd84b85b905db2c20ec70e8805b36e5" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=4f3dc8efbe18b50a6abe70d8b2f862ce185542d6" MPT: (no debugging symbols found)...done. MPT: 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0 MPT: (gdb) #0 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0 MPT: #1 0x00002aaaabd6f32c in mpi_sgi_system (command=<optimized out>) at sig.c:99 MPT: #2 MPI_SGI_stacktraceback (header=<optimized out>) at sig.c:319 MPT: #3 0x00002aaaabcc4a4a in print_traceback (ecode=0) at abort.c:197 MPT: #4 0x00002aaaabcc4b3e in MPI_SGI_abort () at abort.c:85 MPT: #5 0x00002aaaabd7ff4f in try_repush (gap=<optimized out>, MPT: now=2886904.607016339) at ud.c:1599 MPT: #6 0x00002aaaabd80eb6 in MPI_SGI_ud_progress () at ud.c:1769 MPT: #7 0x00002aaaabd57f62 in MPI_SGI_progress_devices () at progress.c:118 MPT: #8 MPI_SGI_progress () at progress.c:241 MPT: #9 0x00002aaaabd690d5 in MPI_SGI_request_wait (request=0x7fffffffa3dc, MPT: status=0x2aaaac010970 <mpi_sgi_status_ignore>, set=0x7fffffffa3e4, MPT: gen_rc=0x7fffffffa3e0) at req.c:1703 MPT: #10 0x00002aaaabcd6d32 in MPI_SGI_barrier_basic (comm=6) at barrier.c:61 MPT: #11 0x00002aaaabcd6e65 in MPI_SGI_barrier (comm=6) at barrier.c:204 MPT: #12 0x00002aaaabcd70f5 in MPI_SGI_barrier (comm=<optimized out>) MPT: at barrier.c:166 MPT: #13 0x00002aaaabcd7193 in PMPI_Barrier (comm=1) at barrier.c:344 MPT: #14 0x00002aaaabcd72df in pmpi_barrier__ () MPT: from /usr/local/sgi/mpi/mpt-2.12/opt/sgi/mpt/mpt-2.12/lib/libmpi.so MPT: #15 0x000000000041bd07 in MAIN__ () MPT: #16 0x00000000004165c6 in main () MPT: (gdb) A debugging session is active. MPT: MPT: Inferior 1 [process 5232] will be detached. MPT: MPT: Quit anyway? (y or n) [answered Y; input not from terminal] MPT: Detaching from program: /proc/5232/exe, process 5232 MPT: Attaching to program: /proc/31976/exe, process 31976 MPT: Try: zypper install -C "debuginfo(build-id)=d4191084441e39a7b480fc4b41f67083812e9811" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=682e4f7a27ee294a58f17249a0717861db546f2d" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=32c1c7a7a20b54ac3af6b2f436b3375ffeb12f0b" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=2f51a06469a025d507534fe292dcf4e02235bd18" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=91334fd8105f0b62c0bdbbec14b45a9fd043f4c3" MPT: (no debugging symbols found)...done. MPT: [Thread debugging using libthread_db enabled] MPT: Using host libthread_db library "/lib64/libthread_db.so.1". MPT: Try: zypper install -C "debuginfo(build-id)=4aee0c3923838575483ebd16be6db85ecb6f0b75" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=3b149eccd897f1f37dce50ad22614043eba757a2" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=a95c0ce9f9752baf052d3b55b3b8ce19f662d2eb" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=719375f80fd84b85b905db2c20ec70e8805b36e5" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=4f3dc8efbe18b50a6abe70d8b2f862ce185542d6" MPT: (no debugging symbols found)...done. MPT: 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0 MPT: (gdb) #0 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0 MPT: #1 0x00002aaaabd6f32c in mpi_sgi_system (command=<optimized out>) at sig.c:99 MPT: #2 MPI_SGI_stacktraceback (header=<optimized out>) at sig.c:319 MPT: #3 0x00002aaaabcc4a4a in print_traceback (ecode=0) at abort.c:197 MPT: #4 0x00002aaaabcc4b3e in MPI_SGI_abort () at abort.c:85 MPT: #5 0x00002aaaabd7ff4f in try_repush (gap=<optimized out>, MPT: now=2886904.7364443871) at ud.c:1599 MPT: #6 0x00002aaaabd80eb6 in MPI_SGI_ud_progress () at ud.c:1769 MPT: #7 0x00002aaaabd57f62 in MPI_SGI_progress_devices () at progress.c:118 MPT: #8 MPI_SGI_progress () at progress.c:241 MPT: #9 0x00002aaaabd690d5 in MPI_SGI_request_wait (request=0x7fffffffa3dc, MPT: status=0x2aaaac010970 <mpi_sgi_status_ignore>, set=0x7fffffffa3e4, MPT: gen_rc=0x7fffffffa3e0) at req.c:1703 MPT: #10 0x00002aaaabcd6d32 in MPI_SGI_barrier_basic (comm=6) at barrier.c:61 MPT: #11 0x00002aaaabcd6e65 in MPI_SGI_barrier (comm=6) at barrier.c:204 MPT: #12 0x00002aaaabcd70f5 in MPI_SGI_barrier (comm=<optimized out>) MPT: at barrier.c:166 MPT: #13 0x00002aaaabcd7193 in PMPI_Barrier (comm=1) at barrier.c:344 MPT: #14 0x00002aaaabcd72df in pmpi_barrier__ () MPT: from /usr/local/sgi/mpi/mpt-2.12/opt/sgi/mpt/mpt-2.12/lib/libmpi.so MPT: #15 0x000000000041bd07 in MAIN__ () MPT: #16 0x00000000004165c6 in main () MPT: (gdb) A debugging session is active. MPT: MPT: Inferior 1 [process 31976] will be detached. MPT: MPT: Quit anyway? (y or n) [answered Y; input not from terminal] MPT: Detaching from program: /proc/31976/exe, process 31976 MPT: Attaching to program: /proc/24838/exe, process 24838 MPT: Try: zypper install -C "debuginfo(build-id)=d4191084441e39a7b480fc4b41f67083812e9811" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=682e4f7a27ee294a58f17249a0717861db546f2d" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=32c1c7a7a20b54ac3af6b2f436b3375ffeb12f0b" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=2f51a06469a025d507534fe292dcf4e02235bd18" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=91334fd8105f0b62c0bdbbec14b45a9fd043f4c3" MPT: (no debugging symbols found)...done. MPT: [Thread debugging using libthread_db enabled] MPT: Using host libthread_db library "/lib64/libthread_db.so.1". MPT: Try: zypper install -C "debuginfo(build-id)=4aee0c3923838575483ebd16be6db85ecb6f0b75" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=3b149eccd897f1f37dce50ad22614043eba757a2" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=a95c0ce9f9752baf052d3b55b3b8ce19f662d2eb" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=719375f80fd84b85b905db2c20ec70e8805b36e5" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=4f3dc8efbe18b50a6abe70d8b2f862ce185542d6" MPT: (no debugging symbols found)...done. MPT: 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0 MPT: (gdb) #0 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0 MPT: #1 0x00002aaaabd6f32c in mpi_sgi_system (command=<optimized out>) at sig.c:99 MPT: #2 MPI_SGI_stacktraceback (header=<optimized out>) at sig.c:319 MPT: #3 0x00002aaaabcc4a4a in print_traceback (ecode=0) at abort.c:197 MPT: #4 0x00002aaaabcc4b3e in MPI_SGI_abort () at abort.c:85 MPT: #5 0x00002aaaabd7ff4f in try_repush (gap=<optimized out>, MPT: now=2886899.4397963542) at ud.c:1599 MPT: #6 0x00002aaaabd80eb6 in MPI_SGI_ud_progress () at ud.c:1769 MPT: #7 0x00002aaaabd57f62 in MPI_SGI_progress_devices () at progress.c:118 MPT: #8 MPI_SGI_progress () at progress.c:241 MPT: #9 0x00002aaaabd690d5 in MPI_SGI_request_wait (request=0x7fffffffa324, MPT: status=0x2aaaac010970 <mpi_sgi_status_ignore>, set=0x7fffffffa320, MPT: gen_rc=0x7fffffffa31c) at req.c:1703 MPT: #10 0x00002aaaabd7324d in MPI_SGI_recv (buf=<optimized out>, MPT: count=<optimized out>, type=<optimized out>, des=<optimized out>, MPT: tag=<optimized out>, comm=<optimized out>, MPT: status=0x2aaaac010970 <mpi_sgi_status_ignore>) at sugar.c:40 MPT: #11 0x00002aaaabcd6df3 in MPI_SGI_barrier_basic (comm=6) at barrier.c:74 MPT: #12 0x00002aaaabcd6e65 in MPI_SGI_barrier (comm=6) at barrier.c:204 MPT: #13 0x00002aaaabcd70f5 in MPI_SGI_barrier (comm=<optimized out>) MPT: at barrier.c:166 MPT: #14 0x00002aaaabcd7193 in PMPI_Barrier (comm=1) at barrier.c:344 MPT: #15 0x00002aaaabcd72df in pmpi_barrier__ () MPT: from /usr/local/sgi/mpi/mpt-2.12/opt/sgi/mpt/mpt-2.12/lib/libmpi.so MPT: #16 0x000000000041bd07 in MAIN__ () MPT: #17 0x00000000004165c6 in main () MPT: (gdb) A debugging session is active. MPT: MPT: Inferior 1 [process 24838] will be detached. MPT: MPT: Quit anyway? (y or n) [answered Y; input not from terminal] MPT: Detaching from program: /proc/24838/exe, process 24838 MPT: Attaching to program: /proc/12467/exe, process 12467 MPT: Try: zypper install -C "debuginfo(build-id)=d4191084441e39a7b480fc4b41f67083812e9811" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=682e4f7a27ee294a58f17249a0717861db546f2d" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=32c1c7a7a20b54ac3af6b2f436b3375ffeb12f0b" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=2f51a06469a025d507534fe292dcf4e02235bd18" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=91334fd8105f0b62c0bdbbec14b45a9fd043f4c3" MPT: (no debugging symbols found)...done. MPT: [Thread debugging using libthread_db enabled] MPT: Using host libthread_db library "/lib64/libthread_db.so.1". MPT: Try: zypper install -C "debuginfo(build-id)=4aee0c3923838575483ebd16be6db85ecb6f0b75" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=3b149eccd897f1f37dce50ad22614043eba757a2" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=a95c0ce9f9752baf052d3b55b3b8ce19f662d2eb" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=719375f80fd84b85b905db2c20ec70e8805b36e5" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=4f3dc8efbe18b50a6abe70d8b2f862ce185542d6" MPT: (no debugging symbols found)...done. MPT: 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0 MPT: (gdb) #0 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0 MPT: #1 0x00002aaaabd6f32c in mpi_sgi_system (command=<optimized out>) at sig.c:99 MPT: #2 MPI_SGI_stacktraceback (header=<optimized out>) at sig.c:319 MPT: #3 0x00002aaaabcc4a4a in print_traceback (ecode=0) at abort.c:197 MPT: #4 0x00002aaaabcc4b3e in MPI_SGI_abort () at abort.c:85 MPT: #5 0x00002aaaabd7ff4f in try_repush (gap=<optimized out>, MPT: now=2886904.2856587609) at ud.c:1599 MPT: #6 0x00002aaaabd80eb6 in MPI_SGI_ud_progress () at ud.c:1769 MPT: #7 0x00002aaaabd57f62 in MPI_SGI_progress_devices () at progress.c:118 MPT: #8 MPI_SGI_progress () at progress.c:241 MPT: #9 0x00002aaaabd690d5 in MPI_SGI_request_wait (request=0x7fffffffa35c, MPT: status=0x2aaaac010970 <mpi_sgi_status_ignore>, set=0x7fffffffa364, MPT: gen_rc=0x7fffffffa360) at req.c:1703 MPT: #10 0x00002aaaabcd6d32 in MPI_SGI_barrier_basic (comm=6) at barrier.c:61 MPT: #11 0x00002aaaabcd6e65 in MPI_SGI_barrier (comm=6) at barrier.c:204 MPT: #12 0x00002aaaabcd70f5 in MPI_SGI_barrier (comm=<optimized out>) MPT: at barrier.c:166 MPT: #13 0x00002aaaabcd7193 in PMPI_Barrier (comm=1) at barrier.c:344 MPT: #14 0x00002aaaabcd72df in pmpi_barrier__ () MPT: from /usr/local/sgi/mpi/mpt-2.12/opt/sgi/mpt/mpt-2.12/lib/libmpi.so MPT: #15 0x000000000041bd07 in MAIN__ () MPT: #16 0x00000000004165c6 in main () MPT: (gdb) A debugging session is active. MPT: MPT: Inferior 1 [process 12467] will be detached. MPT: MPT: Quit anyway? (y or n) [answered Y; input not from terminal] MPT: Detaching from program: /proc/12467/exe, process 12467 MPT: Attaching to program: /proc/12240/exe, process 12240 MPT: Try: zypper install -C "debuginfo(build-id)=d4191084441e39a7b480fc4b41f67083812e9811" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=682e4f7a27ee294a58f17249a0717861db546f2d" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=32c1c7a7a20b54ac3af6b2f436b3375ffeb12f0b" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=2f51a06469a025d507534fe292dcf4e02235bd18" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=91334fd8105f0b62c0bdbbec14b45a9fd043f4c3" MPT: (no debugging symbols found)...done. MPT: [Thread debugging using libthread_db enabled] MPT: Using host libthread_db library "/lib64/libthread_db.so.1". MPT: Try: zypper install -C "debuginfo(build-id)=4aee0c3923838575483ebd16be6db85ecb6f0b75" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=3b149eccd897f1f37dce50ad22614043eba757a2" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=a95c0ce9f9752baf052d3b55b3b8ce19f662d2eb" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=719375f80fd84b85b905db2c20ec70e8805b36e5" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=4f3dc8efbe18b50a6abe70d8b2f862ce185542d6" MPT: (no debugging symbols found)...done. MPT: 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0 MPT: (gdb) #0 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0 MPT: #1 0x00002aaaabd6f32c in mpi_sgi_system (command=<optimized out>) at sig.c:99 MPT: #2 MPI_SGI_stacktraceback (header=<optimized out>) at sig.c:319 MPT: #3 0x00002aaaabcc4a4a in print_traceback (ecode=0) at abort.c:197 MPT: #4 0x00002aaaabcc4b3e in MPI_SGI_abort () at abort.c:85 MPT: #5 0x00002aaaabd7ff4f in try_repush (gap=<optimized out>, MPT: now=2886905.1025811359) at ud.c:1599 MPT: #6 0x00002aaaabd80eb6 in MPI_SGI_ud_progress () at ud.c:1769 MPT: #7 0x00002aaaabd57f62 in MPI_SGI_progress_devices () at progress.c:118 MPT: #8 MPI_SGI_progress () at progress.c:241 MPT: #9 0x00002aaaabd690d5 in MPI_SGI_request_wait (request=0x7fffffffa3dc, MPT: status=0x2aaaac010970 <mpi_sgi_status_ignore>, set=0x7fffffffa3e4, MPT: gen_rc=0x7fffffffa3e0) at req.c:1703 MPT: #10 0x00002aaaabcd6d32 in MPI_SGI_barrier_basic (comm=6) at barrier.c:61 MPT: #11 0x00002aaaabcd6e65 in MPI_SGI_barrier (comm=6) at barrier.c:204 MPT: #12 0x00002aaaabcd70f5 in MPI_SGI_barrier (comm=<optimized out>) MPT: at barrier.c:166 MPT: #13 0x00002aaaabcd7193 in PMPI_Barrier (comm=1) at barrier.c:344 MPT: #14 0x00002aaaabcd72df in pmpi_barrier__ () MPT: from /usr/local/sgi/mpi/mpt-2.12/opt/sgi/mpt/mpt-2.12/lib/libmpi.so MPT: #15 0x000000000041bd07 in MAIN__ () MPT: #16 0x00000000004165c6 in main () MPT: (gdb) A debugging session is active. MPT: MPT: Inferior 1 [process 12240] will be detached. MPT: MPT: Quit anyway? (y or n) [answered Y; input not from terminal] MPT: Detaching from program: /proc/12240/exe, process 12240 MPT: Attaching to program: /proc/15596/exe, process 15596 MPT: Try: zypper install -C "debuginfo(build-id)=d4191084441e39a7b480fc4b41f67083812e9811" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=682e4f7a27ee294a58f17249a0717861db546f2d" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=32c1c7a7a20b54ac3af6b2f436b3375ffeb12f0b" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=2f51a06469a025d507534fe292dcf4e02235bd18" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=91334fd8105f0b62c0bdbbec14b45a9fd043f4c3" MPT: (no debugging symbols found)...done. MPT: [Thread debugging using libthread_db enabled] MPT: Using host libthread_db library "/lib64/libthread_db.so.1". MPT: Try: zypper install -C "debuginfo(build-id)=4aee0c3923838575483ebd16be6db85ecb6f0b75" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=3b149eccd897f1f37dce50ad22614043eba757a2" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=a95c0ce9f9752baf052d3b55b3b8ce19f662d2eb" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=719375f80fd84b85b905db2c20ec70e8805b36e5" MPT: (no debugging symbols found)...done. MPT: Try: zypper install -C "debuginfo(build-id)=4f3dc8efbe18b50a6abe70d8b2f862ce185542d6" MPT: (no debugging symbols found)...done. MPT: 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0 MPT: (gdb) #0 0x00002aaaab66937f in waitpid () from /lib64/libpthread.so.0 MPT: #1 0x00002aaaabd6f32c in mpi_sgi_system (command=<optimized out>) at sig.c:99 MPT: #2 MPI_SGI_stacktraceback (header=<optimized out>) at sig.c:319 MPT: #3 0x00002aaaabcc4a4a in print_traceback (ecode=0) at abort.c:197 MPT: #4 0x00002aaaabcc4b3e in MPI_SGI_abort () at abort.c:85 MPT: #5 0x00002aaaabd7ff4f in try_repush (gap=<optimized out>, MPT: now=2886904.6996221012) at ud.c:1599 MPT: #6 0x00002aaaabd80eb6 in MPI_SGI_ud_progress () at ud.c:1769 MPT: #7 0x00002aaaabd57f62 in MPI_SGI_progress_devices () at progress.c:118 MPT: #8 MPI_SGI_progress () at progress.c:241 MPT: #9 0x00002aaaabd690d5 in MPI_SGI_request_wait (request=0x7fffffffa35c, MPT: status=0x2aaaac010970 <mpi_sgi_status_ignore>, set=0x7fffffffa364, MPT: gen_rc=0x7fffffffa360) at req.c:1703 MPT: #10 0x00002aaaabcd6d32 in MPI_SGI_barrier_basic (comm=6) at barrier.c:61 MPT: #11 0x00002aaaabcd6e65 in MPI_SGI_barrier (comm=6) at barrier.c:204 MPT: #12 0x00002aaaabcd70f5 in MPI_SGI_barrier (comm=<optimized out>) MPT: at barrier.c:166 MPT: #13 0x00002aaaabcd7193 in PMPI_Barrier (comm=1) at barrier.c:344 MPT: #14 0x00002aaaabcd72df in pmpi_barrier__ () MPT: from /usr/local/sgi/mpi/mpt-2.12/opt/sgi/mpt/mpt-2.12/lib/libmpi.so MPT: #15 0x000000000041bd07 in MAIN__ () MPT: #16 0x00000000004165c6 in main () MPT: (gdb) A debugging session is active. MPT: MPT: Inferior 1 [process 15596] will be detached. MPT: MPT: Quit anyway? (y or n) [answered Y; input not from terminal] MPT: Detaching from program: /proc/15596/exe, process 15596 MPT: -----stack traceback ends----- MPT: -----stack traceback ends----- MPT: -----stack traceback ends----- MPT: -----stack traceback ends----- MPT: -----stack traceback ends----- MPT: -----stack traceback ends-----
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yes, rank 0 and task 0 are the same thing.
no, it doesn't tell you too much: just that all other ranks wait in an MPI_Barrier, presumably until rank 0 finishes its particularities. Of course, they'll wait forever (or until the timeout), because rank 0 exited with an error..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey John,
I tried this -O2 and it gave me the line where the SIGSEGV is happening
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
3dpic_full_mpi.ex 0000000000862859 Unknown Unknown Unknown
3dpic_full_mpi.ex 000000000086112E Unknown Unknown Unknown
3dpic_full_mpi.ex 000000000081F242 Unknown Unknown Unknown
3dpic_full_mpi.ex 00000000007EBD03 Unknown Unknown Unknown
3dpic_full_mpi.ex 00000000007EF8EB Unknown Unknown Unknown
libpthread.so.0 00002AAAAB669810 Unknown Unknown Unknown
libc.so.6 00002AAAAC126C52 Unknown Unknown Unknown
3dpic_full_mpi.ex 0000000000532072 Unknown Unknown Unknown
3dpic_full_mpi.ex 000000000049FD13 Unknown Unknown Unknown
3dpic_full_mpi.ex 000000000045B7D6 Unknown Unknown Unknown
3dpic_full_mpi.ex 000000000041B001 MAIN__ 545 3dpic_full_mpi.f
3dpic_full_mpi.ex 00000000004165C6 Unknown Unknown Unknown
libc.so.6 00002AAAAC02FC36 Unknown Unknown Unknown
3dpic_full_mpi.ex 00000000004164B9 Unknown Unknown Unknown
srun.slurm: error: borgo010: task 0: Exited with exit code 174
It's what I wanted. Thanks
GB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Gabriele,
good to see that you got some location back. Note, that the location is inaccurate because of optimization, so it might be 10 lines earlier or later as well. Furthermore, it seems that you just compiled the main routine with debugging info. There seem to be 3 more frames in the callstack that are part of your program
3dpic_full_mpi.ex 0000000000532072 Unknown Unknown Unknown
3dpic_full_mpi.ex 000000000049FD13 Unknown Unknown Unknown
3dpic_full_mpi.ex 000000000045B7D6 Unknown Unknown Unknown
3dpic_full_mpi.ex 000000000041B001 MAIN__ 545 3dpic_full_mpi.f
I suggest to recompile all routines and rerun your simulation to see those frames as well
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What do you mean for compiling all the routines? at the moment I'm compiling the code in this way
set -x
mpif90 -I/usr/local/other/SLES11.3/ncl/gcc-4.3.4/6.3.0-static/include -I/usr/local/other/SLES11.3/silo/4.10.2/include -extend-source -r8 -c -O2 -g -traceback 3dpic_full_mpi.f
mpif90 -o 3dpic_full_mpi.exe 3dpic_full_mpi.o -L/usr/local/other/SLES11.3/ncl/gcc-4.3.4/6.3.0-static/lib -L/usr/local/other/SLES11.3/silo/4.10.2/lib -lsiloh5 -lhdf5_hl -lhdf5 -lsz -lz -lm -lrt -ldl -lstdc++ -O2 -g -traceback
I have many subroutines in the code. I have only 2 other files files, but they are constants and variables and I include them in the .f file showed above.
Thanks
GB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
my hunch is that your code crashes in a call to the silo or hdf5 library, since those are not compiled with debug info.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think you are quiet right, and I should have found the error (I know my code, so I know that's SILO the critical part).
-O2 this time was very precise: that is the exact line where the code breaks.
How does -O2 work? How can I know how much to trust it? You said it has a 10 lines tolerance, but if in the future I insert some useless lines with dum operations does it still maintain this 10 lines sensibility?
thanks
GB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
the 10 lines I mentioned is just a number I made up: it could be more or less, just don't rely on it being exact. The compiler could move around statements, so there is only a loose correspondence between the order in your source code and the order in the executable.
In this case, the compiler can't optimize the call to an external library, that's why you see the exact location.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
GB,
I gather from the name of the compute node (borg...) that you are running on the discover cluster at NASA/Goddard. You should contact the NCCS support group at support@nccs.nasa.gov.
Dan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dan,
I had done it. But John suggestion have been more helpful.
GB
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page