Community
cancel
Showing results for 
Search instead for 
Did you mean: 
snoweel
Beginner
66 Views

Need help debugging in OpenMP

I am a 20-year experienced Fortran programmer used to using MPI but I am new to OpenMP. I am having quite a bit of trouble debugging some code that I am trying to run (a coupled weather/land surface model) where I end up getting some NaNs in one particular cell. I am getting completely unexpected behaviors for my outputs. I have located a place where I can WRITE a particular variable and get a NaN (about 15 minutes into the model run). However if I back up a couple of lines and try to WRITE the variable (or anything else), the model runs for an hour and times out. (This same behavior happens at various other places in the code where I try to print stuff out.) There are some OpenMP directives near there like:

!$OMP PARALLEL DO &
!$OMP PRIVATE (ij)

Should that have anything to do with whether I can print out variables, or am I chasing a wild goose? There are also BENCH_START() and BENCH_END() directives which I guess are for some timing diagnostic program.


0 Kudos
4 Replies
TimP
Black Belt
66 Views

The OpenMP directives aren't likely to have an effect if you don't enable OpenMP compilation. With OpenMP enabled, no doubt they will worsen the effect of program bugs, including, but not limited to, those which you could uncover before compilation by /Qdiag-enable:sc and at run time by /check.
jimdempseyatthecove
Black Belt
66 Views


NaN's will generally get created with division by 0 or an operation with other NaN. Other NaN may be the result of operation with NaN or Junk data that resembles NaN. Once NaN is generated it tends to be "sticky" until it is overwritten with non-NaN.

OpenMP regions can default to shared or private. Ifthe variable in your WRITE statement is explicitly stated as SHARED then the instance of the variable used is that of outside the parallel region, conversely if the variable in your WRITE statement is explicitly stated as PRIVATE then the instance of the variable used is that of inside the parallel region. When the variable is not explicitly stated then the default is used.

Note, private variables (inside parallel region) do NOT automatically get initialized with the contents of that same named variable outside the parallel region. You must use COPYIN or FIRSTPRIVATE clause, or REDUCEto alter the behavior for initialization of private variables within the parallel region. For variables not automatically initialized they are considered Junk data andhave some probability of being NaN but in any event, when random valid number thiswill affect the results of the computations. Garbage in Garbage out.

In place of writing the variable to see if NaN try

if(ISNAN(yourVar)) CALL FoundNaN()

Then add

subroutine FoundNaN()
write(*,*) "Found NaN" ! place break point here
end subroutine FoundNaN

Place break point accordingly. Then when NaN is detected, use debugger to walk up (or return) the call stack. See what variable was the NaN. Then look at any expressions used to write to that variable. Check those variables for NaN's. If you find a constituent(s) in an expression that is(are) a NaN. Insert the

if(ISNAN(yourConstituentVar)) CALL FoundNaN()

In the appropriate places where this variable is calculated. Keep moving the tests up, eventually you will find the error. If it takes you 4 such tests, expect an hour to find the bug.

Good luck hunting.

Jim Dempsey
snoweel
Beginner
66 Views


NaN's will generally get created with division by 0 or an operation with other NaN. Other NaN may be the result of operation with NaN or Junk data that resembles NaN. Once NaN is generated it tends to be "sticky" until it is overwritten with non-NaN.

OpenMP regions can default to shared or private. Ifthe variable in your WRITE statement is explicitly stated as SHARED then the instance of the variable used is that of outside the parallel region, conversely if the variable in your WRITE statement is explicitly stated as PRIVATE then the instance of the variable used is that of inside the parallel region. When the variable is not explicitly stated then the default is used.

Note, private variables (inside parallel region) do NOT automatically get initialized with the contents of that same named variable outside the parallel region. You must use COPYIN or FIRSTPRIVATE clause, or REDUCEto alter the behavior for initialization of private variables within the parallel region. For variables not automatically initialized they are considered Junk data andhave some probability of being NaN but in any event, when random valid number thiswill affect the results of the computations. Garbage in Garbage out.

In place of writing the variable to see if NaN try

if(ISNAN(yourVar)) CALL FoundNaN()

Then add

subroutine FoundNaN()
write(*,*) "Found NaN" ! place break point here
end subroutine FoundNaN

Place break point accordingly. Then when NaN is detected, use debugger to walk up (or return) the call stack. See what variable was the NaN. Then look at any expressions used to write to that variable. Check those variables for NaN's. If you find a constituent(s) in an expression that is(are) a NaN. Insert the

if(ISNAN(yourConstituentVar)) CALL FoundNaN()

In the appropriate places where this variable is calculated. Keep moving the tests up, eventually you will find the error. If it takes you 4 such tests, expect an hour to find the bug.

Good luck hunting.

Jim Dempsey

Thanks, I will try this.

Clay
mahmoudgalal1985
Beginner
66 Views


NaN's will generally get created with division by 0 or an operation with other NaN. Other NaN may be the result of operation with NaN or Junk data that resembles NaN. Once NaN is generated it tends to be "sticky" until it is overwritten with non-NaN.

OpenMP regions can default to shared or private. Ifthe variable in your WRITE statement is explicitly stated as SHARED then the instance of the variable used is that of outside the parallel region, conversely if the variable in your WRITE statement is explicitly stated as PRIVATE then the instance of the variable used is that of inside the parallel region. When the variable is not explicitly stated then the default is used.

Note, private variables (inside parallel region) do NOT automatically get initialized with the contents of that same named variable outside the parallel region. You must use COPYIN or FIRSTPRIVATE clause, or REDUCEto alter the behavior for initialization of private variables within the parallel region. For variables not automatically initialized they are considered Junk data andhave some probability of being NaN but in any event, when random valid number thiswill affect the results of the computations. Garbage in Garbage out.

In place of writing the variable to see if NaN try

if(ISNAN(yourVar)) CALL FoundNaN()

Then add

subroutine FoundNaN()
write(*,*) "Found NaN" ! place break point here
end subroutine FoundNaN

Place break point accordingly. Then when NaN is detected, use debugger to walk up (or return) the call stack. See what variable was the NaN. Then look at any expressions used to write to that variable. Check those variables for NaN's. If you find a constituent(s) in an expression that is(are) a NaN. Insert the

if(ISNAN(yourConstituentVar)) CALL FoundNaN()

In the appropriate places where this variable is calculated. Keep moving the tests up, eventually you will find the error. If it takes you 4 such tests, expect an hour to find the bug.

Good luck hunting.

Jim Dempsey

Thanks,Really helps.
Reply