Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
5 Views

SIGSEGV bug on one platform but not on the other platform

Hi, working with a relatively complex software system, the offending code pointed out from err file traceback is this:

                    wrt_int_state%BND_VARS_V%VAR_3D(NV)%SOUTH(NA,NB,NC,NT)=BUFF_NTASK(NX)

I will list the things I know and I really appreciate any help on clarifying this. Is this a user code problem or compiler bug?

Compile command line:

mpiifort -g -openmp -mkl=sequential -align array32byte -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread -openmp -convert big_endian -assume byterecl -mkl=sequential -O3 -g -traceback -ftz

Compiler version:

intel/14.0.2

impi/5.1.1.109

What I have tried:

  I found the SIGSEGV crash will go away if I dump the numbers involved and they actually are all correct. I find it strange why touching the memory will fix the problem.

                   print *, NVARS_BC_3D_V, NV, NT, NC, NB, &
                             wrt_int_state%LOCAL_ISTART(NTASK),wrt_int_state%LOCAL_IEND(NTASK), NA, &
                             'wrt_int_state%BND_VARS_V%VAR_3D(NV)%SOUTH', size(wrt_int_state%BND_VARS_V%VAR_3D(NV)%SOUTH)
                    wrt_int_state%BND_VARS_V%VAR_3D(NV)%SOUTH(NA,NB,NC,NT)=BUFF_NTASK(NX)

 

 The exact same code when built on a different system using also intel compilers does not crash at all.

 

0 Kudos
5 Replies
Highlighted
Employee
5 Views

This article may have something that may help, Determining Root Cause of Segmentation Faults SIGSEGV or SIGBUS errors

Instead of rebuilding the app on the other system where it runs successfully, are you able to run the same executable on the other system that you built that fails?    If that same executable runs successfully on the other system then start looking at system differences like shell stack limits as discussed in Cause #2 of the above article.

0 Kudos
Highlighted
5 Views

IVF 14.0.2 is relatively old.

Often when an optimized program (-O3 in this case) fails, but then works when you insert a trace PRINT as you have done, then this indicates that there may be an optimization problem. A newer version of the compiler may fix this issue.

As a work around, you might try this:

Leave the diagnostic print in place, however, encapsulate it within an IF(expression) THEN ... ENDIF

*** where the expression is light weight, always .false., and cannot be determined at compile time.

Perhaps IF(NTASK == -9999) THEN ...

The above requires NTASK to be a variable determined only at run time. If this is not the case, then use a different expression that is known to be always false (at run time).

Jim Dempsey

0 Kudos
Highlighted
Beginner
5 Views

Davis, Kevin D wrote:

This article may have something that may help, Determining Root Cause of Segmentation Faults SIGSEGV or SIGBUS errors

Instead of rebuilding the app on the other system where it runs successfully, are you able to run the same executable on the other system that you built that fails?    If that same executable runs successfully on the other system then start looking at system differences like shell stack limits as discussed in Cause #2 of the above article.

I have to rebuild the executible, the two systems are quite different.

0 Kudos
Highlighted
Beginner
5 Views

jimdempseyatthecove wrote:

IVF 14.0.2 is relatively old.

Often when an optimized program (-O3 in this case) fails, but then works when you insert a trace PRINT as you have done, then this indicates that there may be an optimization problem. A newer version of the compiler may fix this issue.

As a work around, you might try this:

Leave the diagnostic print in place, however, encapsulate it within an IF(expression) THEN ... ENDIF

*** where the expression is light weight, always .false., and cannot be determined at compile time.

Perhaps IF(NTASK == -9999) THEN ...

The above requires NTASK to be a variable determined only at run time. If this is not the case, then use a different expression that is known to be always false (at run time).

Jim Dempsey

 

Jim, thanks, I tried your suggestion, it moved the crash to another line later in the code. At this point, I believe it's just a compiler issue with -O3 optimization. Is there a intel version I should use to avoid this. On the problem platform, I have the following intel versions installed:

intel/15.1.133                    
intel/15.3.187                    
intel/15.6.233                    
intel/16.1.150                    
intel/12-12.0.4.191
intel/13.1.3      
intel/14.0.2        
intel/15.0.0        

 

0 Kudos
Highlighted
Employee
5 Views

Without knowing what the underlying issue is there's no way to suggest with certainty that a particular version will avoid it. You have newer versions than your 14.0.2 so try the most recent of those listed, 16.1.150.

0 Kudos