I develop a FORTRAN code on a Linux system but since we did an upgrade of the system I noticed some strange behavior. I deal with huge arrays, therefore the -mcmodel=large flag is employed and the code is parallelized with OPENMPI (compiled with the Intel compiler). For optimization the -O3 flag is used. This has always worked and still does on an different system.
Since the upgrade it seems that the program skips some part of the code, following an example :
do i=0,imax eta=float(i)/float(imax) dr_rel=0.5*(eta) Ru(i)=0.5*dr_rel enddo do i=0,imax write(*,*)i,rank,Ru(i) enddo
, where imax is a constant. This produces:
... 244 40 NaN ...
Whereas if I check the value of "eta" for NaN with the function ISNAN():
do i=0,imax eta=float(i)/float(imax) dr_rel=0.5*(eta) Ru(i)=0.5*dr_rel if(ISNAN(eta)) then write(*,*)'eta' stop endif enddo do i=0,imax write(*,*)i,rank,Ru(i) enddo
The output gives me a real number:
... 99 25 0.1884974 ...
I suspect, that the optimization just skips the loop, yet by calling the ISNAN() the loop is executed.
Moreover if I skip the -mcmodel=large flag it works perfectly fine either way. Has anyone ever seen such a behavior?
Any help is greatly appreciated!
In a parallel program, the DO loop may be split into segments that are executed in different threads (or nodes, under MPI). A barrier/synchronization directive is needed before you access the loop variables after the end of the loop. Otherwise, some of the values that you are printing may be stale values that have not yet been updated by one of the cooperating threads or nodes.
Sure I understand that but if I write it the way I put it in my first post, than just values, which were set on one node are printed out! Every node runs through the whole code and therefore sets the values before they were printed.
It is usually unprofitable to speculate on the behavior of a large parallel program by scrutinizing a twelve line code fragment. On the other hand, I understand that it may not be practical to work with a large body of source code in a forum such as this.
In general, adding a print statement or a function call inhibits some optimizations. If the inhibited optimizations include an optimization that involves an optimizer bug, that bug goes into hiding.
Why confuse the issue by bringing up MPI, etc., if this is a report about a bug in a serial program? If all the nodes are afflicted with the same bug, why not look at a serial version of the program first?
Ok yes that makes sense! I give that a try. The thing which bothers me, is that the -mcmodel=lagre gives me Nan and if I skip it the code produces valid results. But nevertheless I test it with a serial run.
I finally had time to look into this problem again. As you said mecej4 the MPI had nothing to do with the problem, it is just the mcmodel=large flag. I also tried to change the code so that the arrays from static to dynamic allocation, which gave me the same result as before. I also noticed, that the code produces values different from 'NaN' is I add the CB flag. I run the code on a system with a PBS queuing system, is it possible that there is something wrong, because on an other system with the same compiler but without a queuing system it works also with the mcmodel=large flag. I checked the "ompi_info | grep ras" with the following result:
MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v2.0.2) MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component v2.0.2) MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, Component v2.0.2) MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v2.0.2)
Almost a year later and after I almost completely forgot about it I now figured it out. The problem was connected with the floating point precision. By adding the
-fltconsistency flag it worked.