Runtime error when using OpenMP and compiling for 32 bits

Jauch · ‎05-15-2013

Hello, I had this problem in a previous version of Intel Fortran and in the Intel Fortran XE 2013 sp 3 also.
I'm using Visual Studio 2008.

The problem happens only when I use OpenMP and compile the code as 32 bits. The 64bits version do not causes any error. Also, If i turnoff OpenMP only in the module where I have the code that causes the error, despite the fact that there are no OpenMP directives there, the error also disapears.

Basically I have this code:

do1:    do j = Me%WorkSize%JLB, Me%WorkSize%JUB
do2:    do i = Me%WorkSize%ILB, Me%WorkSize%IUB

                if (Me%VegetationTypes(Me%VegetationID(i,j))%GrowthDatabase%PlantType == NotAPlant) cycle !<= access violation

...

The error disapears if I change the code to this:

do1:    do j = Me%WorkSize%JLB, Me%WorkSize%JUB
do2:    do i = Me%WorkSize%ILB, Me%WorkSize%IUB

                vegID = Me%VegetationID(i,j)
                veg   = Me%VegetationTypes(vegID)

                if (veg%GrowthDatabase%PlantType == NotAPlant) cycle

...

The funny thing is that Me%VegetationTypes and Me%VegetationID are not used inside any OpenMP block, and there isn't any OpenMP block in the module where this code exists.

Here is the declaration of the variables:

type       T_Vegetation
    integer, dimension(:,:), pointer                        :: VegetationID
    type(T_VegetationType), dimension(:), pointer :: VegetationTypes
end type T_Vegetation

Any ideas on why the access violation when compiling in 32bits and with OpenMP?

Thanks for any thoughts that you can give on this! :)

TimP · ‎05-15-2013

It should be useful to file a problem report with full reproducer on premier.intel.com, or at least attach a reproducer here.

With 64-bit mode as well, we have issues with automatic arrays when compiled with -openmp. I can't see from what you posted whether this shares that characteristic. Also there are reported issues of missed optimization associated with -openmp compilation of module arrays.

If you have source files which don't include any OpenMP directives, you might try declaring RECURSIVE procedures and remove -openmp option for building those files. That may be equivalent to what you already tried.

Jauch · ‎05-15-2013

Hello TimP,

First of all, thanks for the fast answer.

In fact, when I build the application with 32 bits and OpenMP, but turnoff OpenMP for the module where the code that gives problems is, there is no more run time errors. The same result is achived with the little change in the code, that was what I used as a workaround.

Because the code that uses the array that causes the problem is used in the exactly same way in others parts of the code, at least on a dozen locations, without any error, and because the code that involves the two arrays are pretty simple, they are created, allocated, initialized and don't change anymore until the end of the run, when they are deallocated, I can't explain the error.

This wasn't the first time that this happens. Some time ago I get a similar error, with another matrix, when compiling with OpenMP a module that didn't have any OpenMP directives. In fact, the same module.

At that time, I make a test to determine if it was a compiler problem. I put a "write (*,*) 'test'" instruction before the line that was giving the problem and the problem disappear. The solution, as in this case, was to change slightly the code.

Now, with the second time that this happens, I choose to came here and ask ;)

I'm not familiar with the term "automatic array", but I assume that is like a temporary array created by the compiler to do a copy or to pass as argument or something like that?
If it is, I don't think so, because bothe arrays are "global" inside the module, so they aren't passed as argument to subroutines nor are changed during the run...

The software I'm working on is an Hydrologic Model, with hundreds of thousands of lines of code, using external libraries (HDF, NETCDF) and modules on others languages (usually C and C++). I'll take a look on the site you pointed out to see how to do a report with full reproducer...

Thanks again.

Eduardo.

TimP · ‎05-15-2013

automatic array is one which exists inside a procedure with size determined when the procedure is entered. It looks like an assumed size array but the allocation is local to the procedure, as for a local array. I mention this only because it has caused me trouble for months, not because I can tell whether you are using it. If the array is a normal module array, this would not apply.

It's sometimes advised to use allocatable instead of automatic, if only because it promotes improved error checking.

The other difference between 32-bit mode and 64-bit mode is the limited extra space available for dynamic arrays in 32-bit mode (none, except with /3GB boot.ini switch, in some versions of Windows).

jimdempseyatthecove · ‎05-15-2013

Can you show a little more code? Include the OpenMP directives such that we can see what is private and public.

Jim Dempsey

Jauch · ‎05-15-2013

Thanks for the explanation, TimP.
I think that the problem is different, once the arrays are not "automatic", but module arrays.

Hello, Jim.
This is one of my concerns. In the module where the error happens there are no OpenMP directives. Also, the two arrays are not used or referenced outside this module. But the error happens ONLY when the conditional OpenMP compilation is activated in the module properties and ONLY when the module is being compiled for 32 bits. If I turn off the conditional OpenMP compilation or change the target to 64 bits, then the error disapears.

jimdempseyatthecove · ‎05-16-2013

Are you perhapse missing an appropriately placed SAVE?

Compiling with -openmp also effectively makes subroutines and functions attributed with RECURSIVE.
Non-OpenMP and non-RECURSIVE (tend to) implicitly make local arrays SAVE.
OpenMP (compiled) or RECURSIVE implicitly makes local arrays on stack.

The module arrays (and scalars) will not be affected.
The module contained procedures with local arrays will be affected.
Code outside a module with local arrays will be affected.

>> In the module where the error happens there are no OpenMP directives

Leaving open the possibility that the module subroutine/function is concurrently called by multiple threads (outside the module).

I will have to admit the use of local temps for vegID and veg affecting the operation. These are copy operations.
What happens if you make vegID and veg pointers then use => in place of =?

Also, from what code you've disclosed, it appears that you are performing different code dependent on VegitationTypes. In the multi-threaded code, is the VegitationTypes changing during the processing of these loops? IOW one thread may be dispatching on a Type code for a type object that has yet to be fully constructed? Meaning the code accidentaly works on x64 and accidentally fails on x32.

Jim Dempsey

TimP · ‎05-16-2013

Jim raises interesting points. Although at one time I think it was said that ifort module data always had SAVE status, there seems to have been a switch, possibly with the aim of exempting them from the 2GB static data+program limit. In the case where the module is visible in main program and treated correctly as shared data, it should work the same.

Steven_L_Intel1 · ‎05-16-2013

Module data is always static. It also implicitly has the SAVE attribute, which the current standard requires.

jimdempseyatthecove · ‎05-16-2013

Steve,

What I said was:

"The module arrays (and scalars) will not be affected."

Meaning module data are uneffected (always static). (in line with what your posted)

However, I said further:

"The module contained procedures with local arrays will be affected"

Meaning, when a module has a contains subroutine or function, and if this subroutine or function has a locally scoped array, without SAVE, (or AUTOMATIC) then the SAVE-ness will vary dependent upon -openmp or RECURSIVE attribute (or equivilent command line option enforcing recursive). A new programmer might not make the distinction between "module data" declared between module and contains (or end module), and (mis-assumed) "module data" declared following contains inside contained subroutine or function. I am stating this here to make this clearer to the other readers.

Jim Dempsey

Steven_L_Intel1 · ‎05-16-2013

Yes, that's right. Thanks for the elaboration.

Jauch · ‎05-16-2013

Hello Jim :)

First, I think will be benefitial to put here all the code that are involved in the use of the two "problematic" arrays. Probably this will answer bettern than me to some of the questions that you arised. I'll separete by module to be easy to see. Because the size of the software, I'll put only the code that is relevant (in my opinion), or else, that will have impact on the arrays.

This way I think will be easier to you guys help me to determine if is some problem with the code.

[fortran]
!====ModuleGlobalData====!
Module ModuleGlobalData

implicit none
public

real   , parameter :: FillValueReal        = -9.9e15
integer, parameter :: FillValueInt         = -9999999

end module ModuleGlobalData

!====program====!
program MohidLand

use ModuleBasin

implicit none

call ConstructMohidLand
call ModifyMohidLand

contains

!--------------------------------------------------------------------------
subroutine ConstructMohidLand ()
call ConstructBasin
end subroutine ConstructMohidLand
!--------------------------------------------------------------------------
subroutine ModifyMohidLand ()
    logical :: one_more = .true.

    do while (one_more)
        one_more = DoOneTimeStep
    enddo
end subroutine ModifyMohidLand
!--------------------------------------------------------------------------
logical function DoOneTimeStep ()
    call ModifyBasin
    !will return true or false depending on code that will tell if it is to finish or not the run
end function
!--------------------------------------------------------------------------

end program MohidLand

!====ModuleBasin====!
Module ModuleBasin

use ModuleVegetation

implicit none
private

contains

!--------------------------------------------------------------------------
subroutine ConstructBasin ()
call ConstructCoupledModules
end subroutine ConstructBasin
!--------------------------------------------------------------------------
subroutine ConstructCoupledModules ()
call ConstructVegetation
end subroutine ConstructCoupledModules
!--------------------------------------------------------------------------
subroutine ModifyBasin ()
call VegetationProcesses
end subroutine ModifyBasin
!--------------------------------------------------------------------------

end module ModuleBasin

!====ModuleVegetation====!
Module ModuleVegetation

implicit none
private

type     T_GrowthDatabase
    real                                        :: PlantType
    !Many other data here
end type T_GrowthDatabase

type T_VegetationType
    integer                                     :: ID
    character(StringLength)                     :: Name
    type (T_GrowthDatabase)                     :: GrowthDatabase
    !Many other data here
    type(T_VegetationType), pointer             :: Next, Prev
end type T_VegetationType

private :: T_Vegetation
type       T_Vegetation
type(T_VegetationType), pointer               :: FirstVegetation
type(T_VegetationType), pointer               :: LastVegetation
type(T_VegetationType), dimension(:), pointer :: VegetationTypes     => null()
integer                                       :: VegetationsNumber
end type   T_Vegetation

type (T_Vegetation), pointer                    :: FirstObjVegetation => null()
type (T_Vegetation), pointer                    :: Me                  => null()

contains

!--------------------------------------------------------------------------
subroutine ConstructVegetation()
    call AllocateVariables
    call ConstructVegetationList
    call ConstructVegetationParameters
    call ConstructVegetationGrids
end subroutine ConstructVegetation
!--------------------------------------------------------------------------
subroutine AllocateVariables
    !ILB, IUB, JLB and JUB are the dimensions of the grid (a 2D matrix)
    allocate(Me%VegetationID (ILB:IUB,JLB:JUB))
    Me%VegetationID (:,:) = FillValueInt
end subroutine AllocateVariables
!--------------------------------------------------------------------------
!--------------------------------------------------------------------------
subroutine ConstructVegetationList ()

    !Local----------------------------------------------------------------
    integer                                 :: AgricPractID

    !---------------------------------------------------------------------
    do
        !Thete is code to read from file the values of "AgricPractID"

        !each AgricPractIDScalar readed from file is passed to CheckVegetationList
        call CheckVegetationList(AgricPractID)

        !When finish reading AgricPractIDScalar, exits the looping
    enddo

end subroutine ConstructVegetationList
!--------------------------------------------------------------------------
subroutine CheckVegetationList (AgricPractID)

    !Arguments-----------------------------------------------------------
    integer                                 :: AgricPractID
    !Local----------------------------------------------------------------
    type (T_VegetationType), pointer        :: VegetationX, VegetationInList
    logical                                 :: FoundVegetation

    !---------------------------------------------------------------------
    FoundVegetation = .false.

    VegetationInList => Me%FirstVegetation
doV:    do while (associated(VegetationInList))
        if (VegetationInList%ID == AgricPractID) then
            FoundVegetation = .true.
            exit doV
        endif
        VegetationInList => VegetationInList%Next
    enddo doV

    if (.not. FoundVegetation) then
        allocate (VegetationX)
        nullify (VegetationX%Prev,VegetationX%Next)
        VegetationX%ID = AgricPractID
        call AddVegetation(VegetationX)
    endif

end subroutine CheckVegetationList
!--------------------------------------------------------------------------
subroutine AddVegetation(NewVegetation)

    !Arguments-------------------------------------------------------------
    type(T_VegetationType), pointer :: NewVegetation

    !----------------------------------------------------------------------
    if (.not.associated(Me%FirstVegetation)) then
        Me%VegetationsNumber    = 1
        Me%FirstVegetation      => NewVegetation
        Me%LastVegetation       => NewVegetation
    else
        NewVegetation%Prev      => Me%LastVegetation
        Me%LastVegetation%Next => NewVegetation
        Me%LastVegetation       => NewVegetation
        Me%VegetationsNumber    = Me%VegetationsNumber + 1
    end if

end subroutine AddVegetation
!--------------------------------------------------------------------------
subroutine ConstructVegetationParameters ()

    !Local-----------------------------------------------------------------
    type (T_VegetationType), pointer            :: VegetationType
    integer                                     :: ivt, VegetationTypeID

    !----------------------------------------------------------------------
    allocate(Me%VegetationTypes(Me%VegetationsNumber))
    ivt = 0

    VegetationType => Me%FirstVegetation
    do while (associated(VegetationType))
        !Here goes code to find (in a file) info for each "vegetation type"
        !pointed by VegetationType%ID

        ivt = ivt + 1
        Me%VegetationTypes(ivt)%ID   = VegetationType%ID

        !Save other informations in Me%VegetationTypes(ivt), like Name
        !and info for the GrowthDatabase.
    enddo

end subroutine ConstructVegetationParameters
!--------------------------------------------------------------------------
subroutine ConstructVegetationGrids ()

    !Local-----------------------------------------------------------------
    integer                                 :: i, j, ivt

    !----------------------------------------------------------------------
    !ILB, IUB, JLB and JUB are the dimensions of the grid (a 2D matrix)
    do j = JLB, JUB
    do i = ILB, IUB

        !Code to fill the array AgricPractID goes here

        do ivt = 1, Me%VegetationsNumber
            if (Me%VegetationTypes(ivt)%ID == AgricPractID(i, j)) then
                Me%VegetationID(i, j) = ivt
            endif
        enddo

    enddo
    enddo

end subroutine ConstructVegetationGrids
!--------------------------------------------------------------------------
subroutine ModifyVegetation()
    !Code that uses Me%VegetationID and Me%VegetationTypes,
    !but never change them.
end subroutine ModifyVEgetation
!--------------------------------------------------------------------------

end module ModuleVegetation
[/fortran]

For those that prefers, I also put this code in pastebin (http://pastebin.com/457TNRSf)
Now, I think that I must explain the idea of this code and some other information that are missing.

First, the model constructs everything it needs (Construct...), than it runs (Modify...) and then it kills everything at the end (not showed).
As you can see in the code that I show above, there isn't any OpenMP directive in any place that interact with the arrays that caused the problem. The ModuleVegetation doesn't have ANY openMP directive anyway.
The OpenMP directives that exist in other modules are very simple and only for parallelize simple loopings. If a looping requires a call to a subroutine than we do not use OpenMP on it.

The "Me" variable that you can see in ModuleVegetation is a module variable (before contains).
Every other data used on the module are inside "Me" (or are really local to the subroutines).
Every module has a "Me" and a pointer to the first element in a list. There are routines to create a new instance of the structure pointed by "Me".
Every routine of a module, when called, receives an "id" and the correct "Me" is "loaded". The exception is the "Construct", that returns an ID when a new instance of "Me" is created.

The basic working is like this:

MohidLand calls ConstructBasin (ModuleBasin) -> ConstructBasin (ModuleBasin) calls ConstructVegetation (ModuleVegetation)
For each step, MohidLand calls ModifyBasin (ModuleBasin) -> ModifyBasin (ModuleBasin) calls ModifyVegetation (ModuleVegetation)

When constructing the Vegetation, the order is like this:

1. First the routine AllocateVariables is called, where "VegetationID" is allocated and initialized.

2. Then the routine ConstructVegetationList is called. In this routine, a user file is readed. This user file is a 2D array.
Each element of this array contains an value (tha is an ID), that can be unique or not (can appear more than one time in the array).
For each value in the array the routine CheckVegetationList is called (and the ID is passed).

3. In CheckVegetationList, the routine runs through a list to see if the value passed (the ID) already exists in the list.
If the value is not found, than a new instance of the type T_VegetationType is allocated, the ID is saved on it, and the data is added to the list through the AddVegetation routine (the list is a linked list).
When CheckVegetationList finish, we will have a list where each ID appears only once and now we know the number of different ID's.

4. ConstructVegetationParameters is called next. Here, "VegetationTypes" is allocated, with a number of elements equal to the number of the different ID's.
For each item in the list the ID stored there is saved in a sequential position of the "VegetationTypes" array and other informations for that ID are also saved on this item.

5. At the end, ConstructVegetationGrids is called, where the VegetationID 2d array is filled with the position of the correct data (item in the VegetationTypes array) for the ID in the user data (position i, j).

In fact, VegetationID is just a "lookup table" to find the information about an ID.
After this "construction", the data is not changed anymore.

The error (access violation) happens when I try to use the VegetationTypes in a specific routine.
We use this exact code in the exact same way in many other routines that are called before and after, without any problems.
I already have debug the code and it works perfectly. I use it for 5 years now.

Despite this being far from "efficient" (it was coded by people that did their best, but that have only basic programming skills), I can't see anything "wrong" enough that could cause an access violation error if this code (inside the looping):

if (Me%VegetationTypes(Me%VegetationID(i,j))%GrowthDatabase%PlantType == NotAPlant) cycle

If I put a write (*,*) Me%VegetationTypes(Me%VegetationID(i,j))%GrowthDatabase%PlantType before the instruction above, no error happens and I see all the "PlantType" data (and is all correct)
In fact, as I said before, anything put before the instruction, even a write (*,*) "test" 'solves' the problem...

This must be something related with the optimizations, but I don't understand why only when OpenMP is on for this module, even not existing any OpenMP directive on it.

The Recursion is inactive... (at least in the properties (I'm using Visual Studio 2008).
The Auto Parallelization is ON, but turn off do not solve the problem when compilen in 32 bits with the OpenMP active.

The two arrays are inside an instance of a T_Vegetation, pointed by Me, allocated in the ConstructVegetation...
I don't know if the SAVE is implicit or not applicable here...

Can I provide any more information?

jimdempseyatthecove · ‎05-17-2013

I see:

type (T_Vegetation), pointer :: Me => null()

and

allocate(Me%VegetationID (ILB:IUB,JLB:JUB))

But I do not see:

allocate(Me)

or

Me => (some T_Vegetation)

Is there some missing code that points Me at a valid object or that allocates Me?

IOW

a) AllocateVariables is missing allocate(Me)
b) Me should be declared as object not pointer to object

Jim Dempsey

Jauch · ‎05-17-2013

Hello Jim :)

In fact, there is missing code that points Me to a valid object ;)
I put the code that shows the use of Me in ModuleVegetation

[fortran]!====ModuleVegetation========================================================

Module ModuleVegetation

implicit none
private

type     T_GrowthDatabase
    real                                        :: PlantType
    !Many other data here
end type T_GrowthDatabase

type T_VegetationType
    integer                                     :: ID
    character(StringLength)                     :: Name
    type (T_GrowthDatabase)                     :: GrowthDatabase
    !Many other data here
    type(T_VegetationType), pointer             :: Next, Prev
end type T_VegetationType

private :: T_Vegetation
type       T_Vegetation
type(T_VegetationType), pointer               :: FirstVegetation
type(T_VegetationType), pointer               :: LastVegetation
type(T_VegetationType), dimension(:), pointer :: VegetationTypes     => null()
integer                                       :: VegetationsNumber
end type   T_Vegetation

type (T_Vegetation), pointer                    :: FirstObjVegetation => null()
type (T_Vegetation), pointer                    :: Me                  => null()

contains
!----------------------------------------------------------------------------
subroutine ConstructVegetation(ObjVegetationID, STAT)

    !Arguments---------------------------------------------------------------
    integer                                         :: ObjVegetationID
    integer, optional, intent(OUT)                  :: STAT

    !Local-------------------------------------------------------------------
    integer                                         :: ready_
    integer                                         :: STAT_, STAT_CALL

    !------------------------------------------------------------------------
    STAT_ = UNKNOWN_

    call Ready(ObjVegetationID, ready_)

    if (ready_ .EQ. OFF_ERR_) then

        call AllocateInstance

        call AllocateVariables
        call ConstructVegetationList
        call ConstructVegetationParameters
        call ConstructVegetationGrids

        ObjVegetationID = Me%InstanceID

        STAT_ = SUCCESS_

    else cd0

        stop 'ModuleVegetation - ConstructVegetation - ERR01'

    end if cd0

    if (present(STAT)) STAT = STAT_

end subroutine ConstructVegetation
!--------------------------------------------------------------------------

!--------------------------------------------------------------------------
subroutine AllocateInstance

    !Local-----------------------------------------------------------------
    type (T_Vegetation), pointer                    :: NewObjVegetation
    type (T_Vegetation), pointer                    :: PreviousObjVegetation

    !----------------------------------------------------------------------
    !Allocates new instance
    allocate (NewObjVegetation)
    nullify (NewObjVegetation%Next)

    !Insert New Instance into list and makes Current point to it
    if (.not. associated(FirstObjVegetation)) then

        FirstObjVegetation         => NewObjVegetation
        Me                         => NewObjVegetation

    else

        PreviousObjVegetation      => FirstObjVegetation
        Me                         => FirstObjVegetation%Next

        do while (associated(Me))
            PreviousObjVegetation => Me
            Me                     => Me%Next
        enddo

        Me                         => NewObjVegetation
        PreviousObjVegetation%Next => NewObjVegetation

    endif

    Me%InstanceID = RegisterNewInstance (mVegetation_)
    !----------------------------------------------------------------------

end subroutine AllocateInstance
!--------------------------------------------------------------------------

!--------------------------------------------------------------------------
subroutine Ready (ObjVegetation_ID, ready_)

    !Arguments-------------------------------------------------------------
    integer                                     :: ObjVegetation_ID
    integer                                     :: ready_

    !----------------------------------------------------------------------
    nullify (Me)

    if (ObjVegetation_ID > 0) then
        call LocateObjVegetation (ObjVegetation_ID)
        ready_ = IDLE_ERR_
    else
        ready_ = OFF_ERR_
    end if
    !----------------------------------------------------------------------

end subroutine Ready
!--------------------------------------------------------------------------

!--------------------------------------------------------------------------
subroutine LocateObjVegetation (ObjVegetationID)

    !Arguments-------------------------------------------------------------
    integer                                     :: ObjVegetationID

    !----------------------------------------------------------------------
    Me => FirstObjVegetation
    do while (associated (Me))
        if (Me%InstanceID == ObjVegetationID) exit
        Me => Me%Next
    enddo

    if (.not. associated(Me)) &
        stop 'ModuleVegetation - LocateObjVegetation - ERR01'
    !----------------------------------------------------------------------

end subroutine LocateObjVegetation
!--------------------------------------------------------------------------

!--------------------------------------------------------------------------
subroutine ModifyVegetation(ObjVegetationID, STAT)

    !Arguments-------------------------------------------------------------
    integer, intent(IN)                            :: ObjVegetationID
    integer, intent(OUT),      optional            :: STAT

    !Local-----------------------------------------------------------------
    integer                                        :: STAT_, ready_
    integer                                        :: STAT_CALL

    !----------------------------------------------------------------------
    STAT_ = UNKNOWN_

    call Ready(ObjVegetationID, ready_)

    if (ready_ .EQ. IDLE_ERR_) then

        !code to run goes here
        STAT_ = SUCCESS_

    else

        STAT_ = ready_

    end if

    if (present(STAT)) STAT = STAT_
    !----------------------------------------------------------------------

end subroutine ModifyVegetation

end module ModuleVegetation

!====END of ModuleVegetation=============================================== [/fortran]

When constructing a new instance, if the ObjectID received is different than ZERO, an error will happens.

When using a module function that needs to relate to a specific instance, the ID previously created when constructing the object must be passed to the routine, as exemplified in the ModifyVegetation Routine.

In the case of T_Vegetation, currently only one instance is created and its ID, returned by the ConstructVegetation subroutine, is stored in the T_Basin object, in module Basin (not shown). But there are many objects that can and have more than one instance at the same time.

The AllocateInstance allocates memory for a new T_Vegetation object and then insert it at the end of a linked list. This list will be used later in LocateObjVegetation when looking for a specific instance of the object.

The "Me" is used only when a routine, like ModifyVegetation, is called. The ID passed to it will be used to find the correct instance and Me will point to it, through the Ready subroutine, that will look in the list of instances for the correct one.

I think that the only routine used above that I do not put the code is the "RegisterNewInstance (mVegetation_)" inside AllocateInstance.
Basically, it receives an object ID (in this case, mVegetation_, that is related to T_Vegetation) and increments the number of instances for the object, returning the new instance ID, that will be bounded to the instance through the ID variable of the object (T_Vegetation ID).

The Me is a module variable, but in fact it is only used to temporarly point to an instance of an object (of type T_Vegetation).

Eduardo.

jimdempseyatthecove · ‎05-17-2013

Eduardo,

The pointers FirstVegetation and LastVegetation are not constructed expressly with => null().
Elsewhere in the code you have "if (.not.associated(Me%FirstVegetation)) then"
Which requires FirstVegetation be either NULL or valid pointer.

What assurances do you have that FirstVegetation and LastVegetation are properly constructed with => null() or valid pointer?
You cannot assume these pointers will be constructed with NULL by default (not even when you observe it constructed with NULL during debug session).

Jim Dempsey

Jauch · ‎05-17-2013

You are right, Jim.

This is another piece of code that I didn't showed...
It's in the ConstructVegetation subroutine:

[fortran]...
!Assures nullification of the global variable
if (.not. ModuleIsRegistered(mVegetation_)) then
    nullify (FirstObjVegetation)
    call RegisterModule (mVegetation_)
endif

call Ready(ObjVegetationID, ready_)

if (ready_ .EQ. OFF_ERR_) then
...[/fortran]

I'm not putting all the code because it would be a mess... There are more than 5000 lines of code only in ModuleVegetation (not counting blank lines) and it uses at least a 50 different subroutines and functions from other modules...

Regarding LastVegetation, it is used only inside the subroutine AddVegetation. The first time it is called, the FIrstVegetation is null, because of the code in the ConstructVegetation that I show above, and LastVegetation is assigned to the new instance. After this, it is used only to add a new instance to the end of the list. Because T_Vegetation always has only one instance, FirstVegetation and LastVegetation always point to the same object instance.

The use of unitialized variables can happen in the code, of course. Many people working on it, not every one paying the attention that should. I myself did this other day (on this module exactly). A local variable that was used before being initialized. But fortunately I found the bug very fast. We try very hard to avoid this kind of error.

My nightmare usually is with logical errors, that do not cause exceptions, but give wrong results, and OpenMP. But there is no OpenMP on this module or involved in the use of this module. At least I couldn't find anything.

The two arrays involved in the error are being correctly initialized. I look the code from initialization of the module until the first use of the arrays, when the error happens.
Can be something that I'm not seeing, but I don't think so...

Still looking, anyway :)

Eduardo.

jimdempseyatthecove · ‎05-20-2013

Try the following experiment:

Produce the configuration where the code fails as described in the first post.
Then: Debug | Breakpoints | Delete all breakpoints
(do not delete them one at a time, click on the icon that deletes all, if icon greayed out, add arbitrary break point, then "delete all")
Exit VS saving configuration, restart VS, run

Does the error go away?

If not, investigation 2:

Put a break point on the line of error, run to break, press continue. Does it fail (first occurance) or continue (later occurance)?

If first occurance, change statement to:

if (Me%VegetationTypes( ( Me%VegetationID(i,j) ) )%GrowthDatabase%PlantType == NotAPlant) cycle

Essentially adding a set of parens where they shouldn't need to be, but where they are benign. There was a post several (many) months ago similar to this where the % evaluation got mucked up (cannot recall just when or with what version of IVF), perhaps Steve Lionel may recall.

Jim Dempsey

John_Campbell · ‎05-20-2013

Eduardo,

I was looking at your module Vegitation and was puzzled by your definition of type T_VegetationType .
Can an element of this type be "type(T_VegetationType), pointer :: Next, Prev" ?
I do not use the F2008 extensions to type definition, but surely this recursion in Type definition is not legal ?
I do not know how to interpret this.

John

[fortran]Module ModuleVegetation
implicit none
private
type     T_GrowthDatabase
    real                                        :: PlantType
    !Many other data here
end type T_GrowthDatabase

type T_VegetationType
    integer                                     :: ID
    character(StringLength)                     :: Name
    type (T_GrowthDatabase)                     :: GrowthDatabase
    !Many other data here
    type(T_VegetationType), pointer             :: Next, Prev         ! is this legal ??
end type T_VegetationType
[/fortran]

Jauch · ‎05-21-2013

Hello,
Sorry for the delay on the answer.

Jim, I tried to do the tests, but meanwhile, the code was changed a lot and I just can't get the error to rise again. At least for now (this was the second time this kind of error happened). I was unable to revert all the changes to get the code exactly as it was when the error happened.

For now, I'll assume that the problem is somewhat with the code and that the changes solve it (or made the error harder to happen).

In the future, If I encounter something like this again, I'll do the tests that you proposed.

John

A pointer is simply a variable that holds a value that represents a memory address, while a Type is just the definition of a data structure, not the data itself.
So, there isn't any problem if, inside a data structure (Type) we have the definition of a pointer that will holds an address for an instance of the same type where it is defined.
In fact, using a pointer (or two, in this case) to the same type is the classical way to define an item of a linked list. Otherwise, you would have to mantain some kind of index using arrays, what would be less practical, I think.
We use this kind of structure in our program since 2004 (at least) and in C/C++ I use this for 20 years now. I think all languages that support some kind of "pointer" can benefit of this.

In the end, because the Type isn't the data itself, but just the definition of it, this isn't recursion.

Even if you point the Next or Prior to the same instance where these pointers are allocated, this wouldn't be recursion. Only if you used these pointers, in a looping, trying to find some specific item of the list, that isn't this instance itself you would have some problematic recursion, like the example above:

[fortran]
type T_Data
integer               ::ID
type(T_Data), pointer :: Next => null()
end type T_Data

type(T_Data), pointer :: First => null()
type(T_Data), pointer :: Data => null()
type(T_Data), pointer :: Temp => null()

allocate(Data)
First => Data

Data%ID = 1
Data%Next => Data

Temp => First
d1: do while (associated(Temp))
if (Temp%ID .ne. 3) then
    Temp => Temp%Next !Here there is a problem, because
                      !Temp is pointing to "Data",
                      !and Data%Next is point to "Data" also
                      !so, Temp => Temp%Next will always point
                      !to the same instance of T_Data,
                      !that is Data itself, and the looping
                      !will never ends
else
    exit do d1
endif
enddo
[/fortran]

But even the code above (if I didn't make any mistake) will compile without any warning, because the compiler can't know (I think) that the address in Data%Next is the same in the Data.

This is one of the problems with pointers. You have to be very carefully when dealing with them.

Any comments or corrections on this poor explanation are welcome :)

jimdempseyatthecove · ‎05-21-2013

John,

I use pointers a lot. Problems you have to be careful about not mentioned above are:

Pointers can have three states: NULL, pointer to valid object, trash value.
Where trash value can be: invalid address OR pointer to object no longer "alive" (trash value).
You've covered the case in your note above where a pointer can point to a valid object yet not be the correct object an/or code not anticipating a given situation (circular list example you gave).

The case of trash value can be easily created in a flip/flop manner between non-OpenMP/OpenMP with the absense of an appropriately placed SAVE (or placement of variable/object). If, for example, the pointee (object that is pointed to) in one configuration (e.g. non-OpenMP) is locally declared without SAVE, and in a second configuration (e.g. OpenMP) is locally declared without SAVE
*** .AND. ***
If this object is required to live after exit from subroutine, then the programmer errored in not having an appropriately placed SAVE.

In the above example case where the object must persist across calls the programmer must add SAVE to locally declared objects .OR. move them to the data portion of an module.

While I have not observed this faux pas in the code you provided, it does not mean it is not present.

An additional observation: The error you observe "access violation" will not be generated (directly) by dereferencing an object that has gone out of scope. While the object may contain junk, the address of the junk data is valid. Should a pointer contained within the junk object be overwritten (e.g. with a REAL value), then erringly de-referencing this pointer may cause "access violation".

A second common cause for "access violation", can be due to a programming error whereby a write to subscript out of bounds reaches into an adjacent object and overwrites a pointer.

A third cause (low probability but observed by this writer) is where code gets modified. Then subsiquent execution of that portion of code causes the access violation.

The code overwriting error is sometimes difficult to find.

Jim Dempsey

Jauch · ‎05-21-2013

Hi Jim,

Good points.

I think that in our code, when data must survive across subroutines, it is declared as a module variable. Usually, everything that must survive this way is ultimately inside a "Me" pointer (T_Something structure). The code is big and can exist something different, but I didn't found nothing yet. If it must exist between calls, is inside the "Me" pointer (that is always a module variable).

With OpenMP we are very cautious. In fact, only two or three people use it in the code. We use only for simpler loopings, without routine callings, where the private variables are very well identified.The only error that I found with OpenMP were related with someone forgetting to make a variable private, or using OpenMP in a looping where there was the need to change the value in more than one place in the same array.

Sometimes we found a bug where a pointer, under some specific conditions, is not correctly initialized. Usually this leads to the model to stops, or because of an access violation, or because the values calculate starts to get really off a valid interval. When it is the last, it's more complicated to find the error, but looking to the code and making a debug usually is enough.

I don't think that our code "changes" during execution, unless there are some problem with the OS...
How can this happen?

Anyway, 99% of our errors are input errors that the model can't identify. 0.99% are bad logic, that causes the model to give bad results, but the execution is ok. And the last are programming errors that usually prevents the code to work, most of them easy to spot, but some few takes many days... :(

And sometimes, without explanation, with changes that are not related to the error (in any way), the error simply vanishes, like this one... :S