Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28456 Discussions

Strange NaN from O0 to O2, please help me

xuzheng97
Beginner
578 Views

Hi,

Recently I meet a strange NaN problem of a MPI fortran program.

TCOMM is a double precision variableto caculate accumulated communication time as following:

TIME1=MPI_WTIME()
......
TIME2=MPI_WTIME()
TCOMM=TCOMM+TIME2-TIME1

There are many same fragments in the souce code.

When I set option from O0 to O2, TCOMM would be NaN.

But when I add "write(*,*) 'TCOMMdebug',MYID,TCOMM,TIME1,TIME2" after each fragment to debug, TCOMM would be normal again.

It is so strange.

Later I used '-O2 -fp-model precise' instead of '-O2', 8/16/32 processed would be normal but 4 processes still NaN.

Could anyone give suggestions on this?

Thanks

0 Kudos
9 Replies
jimdempseyatthecove
Honored Contributor III
578 Views
Was TCOMM initialized?
If it wasn't initialized, the initial value may have been an NaN.

Does anything in your project reference TCOMM besides this statement?
When TCOMM not used, optimizations may remove the statement(s).
0 Kudos
xuzheng97
Beginner
578 Views

Thanks.

I have tried to set TCOMM=0.0D0 at thestart of the main program but useless.

Besides these statements there is another one print sentence of TCOMM at the end of the project.
I guess that optimization should not remove TCOMM related statements.

The strange thing is that with '-fp-model precise' added to O2, 8/16/32 processes can solve NaN problemwhile 4 processes can not.

0 Kudos
TimP
Honored Contributor III
578 Views
It goes without saying that we would like to see all of TIME1,TIME2, and MPI_Wtime() correctly declared (the latter presumably by #include 'mpif.h' in each subroutine requiring it). It might be safer to specify the order of operations
TCOMM = (TIME2-TIME1) + TCOMM
although you would require -assume protect_parens or -fp-model source (|precise) to make it effective.
Your mpi fortran wrapper and header files must of course be the ones built for ifort throughout.
0 Kudos
xuzheng97
Beginner
578 Views
Thanks!

I tested as your guided but unfortunately it did not solve this problem.

After some debug I find that only one TCOMM statement leads to NaN, others are safe. It looks like following:

TIME1=MPI_WTIME()
CALL MPI_SENDRECV(.....)
TIME2=MPI_WTIME()
TCOMM=TCOMM+TIME2-TIME1

I use IEEE_IS_NAN to check TIME2 and TCOMM , they are safe.

TIME1 isnormal after first mpi_wtime call, but after mpi_sendrecv call, it is set to NaN
I deletempi_sendrecv sentence, then TIME1 is always safe.

It seems something wrong in mpi_sendrecv.The mpi_sendrecv does 66 MPI_REAL8 transfer.

What should I do next step?I have no any idea.....

Also twoquestions are:
1.WhyNaN would not appearif I add print or IEEE_IS_NAN sentence on TIME1?
2. Why pgi+MVAPICH is always normal?

Thanks

0 Kudos
xuzheng97
Beginner
578 Views

please see #4

0 Kudos
TimP
Honored Contributor III
578 Views
I guess you aren't using Intel or PGI MPI?
Intel MPI provides additional features for checking buffer overflow, but it looks like you have identified the point of your suspected overflow. Such problems commonly have different effects when you add code or change compilers.
0 Kudos
xuzheng97
Beginner
578 Views
oh.

Iwas using Intel MPI+Intel Compiler to get the NaN.
Would you help to tell me how to check buffer overfolw?
I tested with MVAPICH+PGI Compiler, it seems always safe without any NaN.

I stepped into source code and now I am not sure if array boundary exceeded.
I just modify #4 description.

What should I do further debug?

Thanks
0 Kudos
xuzheng97
Beginner
578 Views

I added '-save -no-bss-init' and it works finally!!!

Now even O3 can work :-)

Thanks for everyone

0 Kudos
jimdempseyatthecove
Honored Contributor III
578 Views
Is TIME1 located immediately following the message being sent to the other systems?

If so, try

type myTime
real(8) :: padd(3) ! arbitrary padd
real(8) :: TIME1, TIME2, TCOMM
end type myTime
...
type(myTime) :: timeVals

timeVal%padd(1) = 0.0D0
timeVal%padd(2) = 0.0D0
timeVal%padd(3) = 0.0D0
timeVal%TCOMM = 0.0D0
...
then use the timVals%... in your code
Run a test, check for timeVal%padd(1) being other than 0.0, if so then you have a buffer overrun situation.

Jim Dempsey
0 Kudos
Reply