Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29277 Discussions

Seek debug tips for crash, when using optimized code; no crash otherwise

tobias-burnus
Beginner
1,754 Views
Hello,



this is with l_fc_c_9.0.032/EM64T.



If I compile with

-g -check -r8

it does not crash -- whereas it does crash using

-O3 -r8 -assume buffered_io -ipo -traceback -g



Unfortunally, the traceback does not help:



forrtl: severe (174): SIGSEGV, segmentation fault occurred

Image PC Routine Line Source

libc.so.6 00002B4B9973662E Unknown Unknown Unknown

[...]



Any idea?

Tobias
0 Kudos
12 Replies
Steven_L_Intel1
Employee
1,754 Views
That was the entire traceback?

Start removing switches such as -ipo. Try compiling some sources with lower optimization. Add log code to follow the execution. Run under the debugger to see how far it gets and perhaps get more details.
0 Kudos
tobias-burnus
Beginner
1,754 Views
Hello,

Steve_Lionel wrote:
That was the entire traceback?

Well, I managed to get some output. The trace now reads:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
libc.so.6          00002B8CFBB2F62E  Unknown               Unknown  Unknown
libc.so.6          00002B8CFBB2FA8C  Unknown               Unknown  Unknown
fleur_r.x          00000000008DA537  Unknown               Unknown  Unknown
fleur_r.x          00000000008DA4A0  Unknown               Unknown  Unknown
fleur_r.x          00000000007EC6EA  m_eigen_mp_eigen_         767  eigen.F
fleur_r.x          0000000000894DBF  MAIN__                    842  fleur.F
fleur_r.x          000000000040362A  Unknown               Unknown  Unknown
libc.so.6          00002B8CFBAE0154  Unknown               Unknown  Unknown
fleur_r.x          0000000000403569  Unknown               Unknown  Unknown
If I look at line 767 in eigen.F, I find:
            IF(ALLOCATED(tuulo)) THEN  ! line 766
DEALLOCATE(tuulo) ! line 767
The whole is embedded in a do loop:
   do
      do
         allocate(tuulo(...))
         ...
         deallocate(tuulo)
      end do
   end do
Any idea?

Tobias

PS: Compile options: -O2 -r8 -assume buffered_io -ip -traceback -g -axP
0 Kudos
Steven_L_Intel1
Employee
1,754 Views
The symptom suggests that the compiler's descriptor for the allocatable variable is being corrupted somehow. Again, I'd start removing optimization options to get the minimum set that show a problem. See if compiling only certain sources with optimization shows the error. If you get stuck, submit an example to Intel Premier Support.
0 Kudos
tobias-burnus
Beginner
1,754 Views
Small addition: It does not crash using -O0, it does crash with -O1.

I was checking how often it allocates and deallocates (since it happens in the loop), but there does not seem to be a problem, my debugging write(*,*) indicate that the arrays are once allocates and that the failure happens at the end of the first loop. (Actually there are 8 arrays allocated/deallocated of which the 6th deallocation fails.)

Whether it crashes also depends on the input parameters of the program. This plus that it calculates for about half an hour makes it hard to produce a small test case :-(

Tobias
0 Kudos
tobias-burnus
Beginner
1,754 Views
Hello,

I found out more, but I don't see the potential problem (and fail to get a reduced test case). With Ifort 9 it only crashes when tlo.F is compiled with -O1, but not with -O0. (Steve, thanks for your tip regarding testing the influence of optimization file by file!)

(With optimization also the Pathscale compiler fails with a dealloc error, albeit a bit earlier.)

I don't see anything obvious which could go wrong, but maybe I overlook something. Below you can find an excerpt.

Tobias
eigen.F
-------------------------
COMPLEX, ALLOCATABLE :: tuulo(:,:,:,:)
...
ALLOCATE(tuulo(0:lmd,-llod:llod,mlotot,j))
tuulo(:,:,:,:) = cmplx(0.0,0.0)
...
CALL tlmplm(tuulo(0,-llod,1,1),mlotot,llod,...)
...
CALL tlmplm(tuulo(0,-llod,1,isp),mlotot,llod,...)
...
IF(ALLOCATED(tuulo)) THEN
  DEALLOCATE(tuulo) ! crashes here with ifort 9
END IF
-------------------------
tlmplm.F
-------------------------
subroutine tlmplm(...)
...
INTEGER, INTENT (IN) :: lmd, llod, mlotot
COMPLEX, INTENT (OUT):: tuulo(0:lmd,-llod:llod,mlotot)
...
CALL tlo(tuulo(0,-llod,mlo),tdulo(0,-llod,mlo),...)
...
end subroutine ! crashes here with Pathscale pgf90 6.0
-------------------------
tlo.F
-------------------------
subroutine tlo(...)
integer, intent(in) :: lmd, llod
COMPLEX, INTENT (OUT) :: tuulo(0:lmd,-llod:llod,*)
...
tuulo(lmp,m,lo) = tuulo(lmp,m,lo) + cil*uvulo(lo,lp,lh)
...
end subroutine
-------------------------
0 Kudos
Steven_L_Intel1
Employee
1,754 Views
There's no point in looking at excerpts - in my experience, the actual problem is almost always in code you left out. This is why a complete example that shows the problem is required for analysis.
0 Kudos
tobias-burnus
Beginner
1,754 Views


Steve_Lionel wrote:
There's no point in looking at excerpts - in my experience, the actual problem is almost always in code you left out. This is why a complete example that shows the problem is required for analysis.



Granted. My problem is only: The involved files are rather long,
  871 eigen.F
  325 tlmplm.F
  239 tlo.F,
and they are part of a bigger program (wc -l *F = 88852), which calculates for several minutes before it even comes to these file. I'm trying to strip it down, but to far it either worked (a too short example) or it did not run/compile. I hardly can send the whole program to the premier support and say: "Find the problem yourself. If you compile 'tlo.F' with -O1 it crashes, if with -O0 it does not. Figure it out yourself, can I?

Tobias
0 Kudos
Steven_L_Intel1
Employee
1,754 Views
All I can suggest is to keep working at it. You CAN send us the whole thing, but it will likely take a lot longer to analyze than if it is done by someone who understands the application.
0 Kudos
TimP
Honored Contributor III
1,754 Views
You should be able to find out whether -O1 works with -fp-model precise, or -mp1, or -mp, or with vectorization off. You could turn vectorization on loop by loop, if necessary. Once you have isolated the problem to a single function, if you still have questions, maybe you would be willing to show enough detail for someone to comment.

Message Edited by tim18 on 04-10-200608:21 AM

0 Kudos
tobias-burnus
Beginner
1,754 Views
Hmm, doing a minor update of the software ("few fixes") plus updating to Intel Fortran Compiler 9.0.033 did the trick - is it no longer crashing at that part - be it with or without debugging.

Thanks a lot again for the very helpful debug suggestions Steve!

:-)

- - - -

But now I'm stuck with some very weird problem (some function calls later, same program):

Memory access error.

Loading the core dump in idb, and calling "where" showed that this occured in mix.F at "INQUIRE (file='n_mmp_mat',exist=l_ldaU)". I couldn't find anything before or after that line which looked suspicious.

Actually, before the INQUIRE line, there are only some integer/float arithmetics (no arrays).

Thus I was cluttering that file with 'write(*,*) "Debug"' followed by 'call flush(6)'. Now I get the same error and "where" points at my first write(*,*):

write(*,*) "Entering mix.F"

Before that line only comes:
MODULE m_mix
CONTAINS
SUBROUTINE mix( 78 parameters )
use 8 modules
implicit none
arguments declaration
definition of local variables

Everything is with compiled with -O0 -g -check all -r8 -traceback
(and calculates 3h).

Anything which rings a bell? How reliably is the file name/line shown in idb - can it be also elsewhere? Could there by anything, which goes wrong when passing the array, which crashes the program without accessing the array, only by passing it? (I'll try to look at the size of the 11 arrays passed as this is the only guess I have; unfortunally, they get allocated in several functions before in fleur.F (main file) "mix" is called (line 1086).)

(idb) where
>0 0x00000000007d0dd9 in M_MIX::mix(...., invs=.TRUE. (4294967295),..., .tmp.I.SI64.NAMAT.len=2) "mix.F":99
#1 0x0000000000ee3899 in fleur() "fleur.F":1086
#2 0x000000000040382a in main(...) in /projects/tob/fleur/bin/fleur_r_debug.x
#3 0x00002b813906a154 in __libc_start_main(...) in /lib64/libc-2.4.so
#4 0x0000000000403740 in _start(...) in /projects/tob/fleur/bin/fleur_r_debug.x

Tobias
0 Kudos
Steven_L_Intel1
Employee
1,754 Views
The line number isn't entirely reliable. When I encounter issues such as this I start stepping through by instruction and see where I am when the error occurs.

Normally, just passing an argument would not cause an error but "it depends"....
0 Kudos
tobias-burnus
Beginner
1,754 Views
Hello,

It turned out that the crash was due to a too small stack size (ulimit -s) when a local, static array in the subroutine was allocated at calling the function.

Unfortunally, the error was rather opaque. (Thanks goes to the premier.intel.com support [364581] for tracking the problem in my test case.)

Tobias

PS: A pitty that with "-g -check all -traceback" this is not captured.
0 Kudos
Reply