Thanks for the quick response

hallevison · ‎09-03-2012

Hi Everyone:

I am getting a seg fault that I don't understand. It occurs on the last line of the following code fragment:

cvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         if( (vrel2.lt.0.0d0) .or. (vrel2.ne.vrel2) ) then
             write(*,*) 'Here #DMM30 ',vrel2
          endif
c^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

         vrel = sqrt(vrel2)

vrel and vrel2 are simple real*8's that are local to the subroutine. The code is compiled with the -O -CB -traceback -fpe0 -recursive -openmp flags. The code is parallelized with OMP but this part of the code is not in a parallel loop. It IS at the bottom of a recursive loop, however. The code runs for about 12 hours before the error occurs and the subroutine is called many billions of times. The traceback looks like:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
stryAFdebug        00000000004B57B4 Unknown               Unknown Unknown
libpthread.so.0    0000003F4820EEE0 Unknown               Unknown Unknown
stryAFdebug        00000000005405DB Unknown               Unknown Unknown
stryAFdebug        0000000000533D85 Unknown               Unknown Unknown
stryAFdebug        00000000004823F3 discard_mass_merg        6507 stryAFdebug.f
stryAFdebug        0000000000478870 symba6_merge_            4454 stryAFdebug.f
stryAFdebug        000000000047564F symba6_step_recur       10729 stryAFdebug.f
stryAFdebug        000000000047537A symba6_step_recur       10712 stryAFdebug.f
stryAFdebug        000000000047537A symba6_step_recur       10712 stryAFdebug.f
stryAFdebug        000000000047537A symba6_step_recur       10712 stryAFdebug.f
stryAFdebug        000000000047537A symba6_step_recur       10712 stryAFdebug.f
stryAFdebug        000000000047537A symba6_step_recur       10712 stryAFdebug.f
stryAFdebug        000000000047537A symba6_step_recur       10712 stryAFdebug.f
stryAFdebug        000000000047537A symba6_step_recur       10712 stryAFdebug.f
stryAFdebug        000000000047537A symba6_step_recur       10712 stryAFdebug.f
stryAFdebug        000000000047537A symba6_step_recur       10712 stryAFdebug.f
stryAFdebug        000000000047537A symba6_step_recur       10712 stryAFdebug.f
stryAFdebug        000000000047537A symba6_step_recur       10712 stryAFdebug.f
stryAFdebug        00000000004776CC symba6_step_recur       10614 stryAFdebug.f
stryAFdebug        0000000000474A4A symba6_step_inter        5085 stryAFdebug.f
stryAFdebug        00000000004A8C1B symba6_step_pl_           183 symba6_step_pl.f
stryAFdebug        000000000040B0E2 MAIN__                    661 stryAFdebug.f
stryAFdebug        000000000040476C Unknown               Unknown Unknown
libc.so.6          0000003F4762135D Unknown               Unknown Unknown
stryAFdebug        0000000000404669 Unknown               Unknown Unknown

I would appreciate any insight into what is going on.

Heinz_B_Intel · ‎09-03-2012

I assume, your application exceeds the available stack size.. Remove any limit for the the stack of the main thread ( Linux Bash Shell:" ulimit -s unlimited"). You mention, that the code is not in an OpenMP-parallel section. But to be sure, please use environment variable KMP_STACKSIZE ( identical to OMP_STACKSIZE) to increase the stack size of the OpenMP threads to a value like 32 MB or more.. There are multiple discussions in this forum about the topic: Search for 'KMP_STACKSIZE'. See too the compiler manual. Don't interpret too much from the source line shown by the traceback: For an optimized application, this not necessarily needs to be correct due to the transformations done by the compiler.

hallevison · ‎09-04-2012

Thanks for the quick reply. I already have set the stacksize to unlimited and have set KMP_STACKSIZE to 64m. I will try to increase KMP_STACKSIZE to see if it makes a difference.

hallevison · ‎09-20-2012

Hi: I just wanted to leave a note concerning the resolution of this issue, particularly because the compiler is not behaving properly, I believe. First, let me point out that I made a mistake in my original posting. vrel2 is a real*16 and not a real*8. vrel is still a real*8, however. First, I played with the value of KMP_STACKSIZE a bit and it made no difference. However, the flags DID make a difference. The code runs fine if I do not set -fpe0, but gives a seg fault if this flag is set. I then modified the code: real*8 vrel2_8,vrel real*16 vrel2 cvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv if( (vrel2.lt.0.0d0) .or. (vrel2.ne.vrel2) ) then ! 2nd checks for NaN write(*,*) 'Here #DMM30 ',vrel2 endif open(77,file='DMM30.dat') write(77,*) vrel2 write(77,*) 'This is before the sqrt of real*8' vrel2_8 = vrel2 vrel = sqrt(vrel2_8) write(77,*) 'This is after the real*8 sqrt' write(77,*) vrel write(77,*) 'This is before the real*16 sqrt' c^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ vrel = sqrt(vrel2) cvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv write(77,*) vrel write(77,*) 'This is after the sqrt of real*16' close(77) c^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The traceback still points to the line that contains vrel = sqrt(vrel2). The file DMM30.dat has: 0.242983437048908040845063283086347 This is before the sqrt of real*8 This is after the real*8 sqrt 0.492933501649977 This is before the real*16 sqrt So, clearly vrel2 contains a reasonable number (which is good for me) and if I move the real*16 to a real*8 before I take the sqrt the code is fine. It dies when I take a sqrt of a real*16 and try to put in a real*8.

Heinz_B_Intel · ‎09-21-2012

I tried you sample code using different compiler versions includng 12.1 amd 13.0 but I don't get a fault and the values written to DMM30.dat are always correct. I have used the options you list above (-O -CB -traceback -fpe0 -recursive -openmp ) but also many other combinations. Please let me know exactly which compiler version you use ( ifort -V) and the complete option list you took for the sample Thanks Heinz

TimP · ‎09-21-2012

if( (vrel2.lt.0.0d0) .or. (vrel2.ne.vrel2) ) then I would guess that the comparison vrel2.ne.vrel2 is replaced at compile time by .false. if you left default optimizations in effect, although your later indication of mixed data types makes this less certain. One of the intrinsics ieee_is_nan or the non-standard legacy equivalent seems more likely to carry out the apparent intent. If you are trying to make this portable to f95 compilers, you will likely need conditional compilation. The ifort directive !dir$ optimize(0) may work, but certain compilers like Open64 will error out on that.

hallevison · ‎09-21-2012

Thanks for the quick response. hal@marvin ~] ifort -v Version 11.1 The code was comiled with -traceback -fpe0 -recursive -openmp options As for the (vrel2.ne.vrel2) - I thought that this is the standard way of determineing whether a number is NaN, but I could be wrong. In any case, I included this code as part of my debuging and it will not part of the production version. Thanks again

hallevison · ‎09-21-2012

One more thing. These lines of code are called many many times before the error occurs.

Heinz_B_Intel · ‎09-24-2012

Thanks for the update. I assume then, the exception occurs only for certain FP values. I only tested this for some constants like 4.0D0 etc. I will look at it again Heinz

Weird seg fault during sqrt