Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
60 Views

How to correctly setup OpenMP directives/clauses for this kind of loop in Fortran

[Edit] I originally posted in the Linux Fortran Compiler forum a couple of weeks ago, but got no replies.  I am assuming that this is a better forum in which to post this message - LEW

We have a science model that simulates an agricultural field for susceptibility to wind erosion.  The current version assumes the entire field consists of the same soil and that the entire field is treated with the same mgt practices.  We have developed a new version that allows the user to specify multiple soil types and mgt practices for the field.  Thus, rather than a single "subregion" (one soil and one set of mgt practices), we can now represent the field with multiple subregions, e.g. multiple soils and mgt practices.  Of course, the code takes significantly longer to run when multiple subregions are being simulated when compiled and run serially.  As a first step in trying to parallelize this code, we attempted to simulate each individual subregions' calculations on separate threads, or at least that is what we thought we were attempting (we have never done any parallelization of Fortran code before).  Here is a brief description of what we have done to parallelize the do loop that steps through each subregion on a daily basis in the model:

         do am0jd = ijday,lcaljday !step through each day of the simulation

            […]

           !$omp parallel do

           do isr=1,nsubr   ! do multiple subregions' daily simulations    

             call submodels(isr, soil(isr), plants(isr)%plant, plants(isr)%plantIndex, restot(isr), croptot(isr), &

                 biotot(isr), decompfac(isr), mandatbs(isr)%mandate, hstate(isr), h1et(isr), h1bal(isr), wp(isr), manFile(isr))

             […]

           end do

           !$omp end parallel do

           […]

         end do

As I understand, the "isr" variable is automatically private and all (or most?) other variables are shared by default, so we don't (shouldn't) need to state that explicitly in the directive.  The test case we are running contains four subregions. There are a couple of issues we've discovered when running the code (sorry that I don't have the actual wallclock runtimes, but I don't think they are necessary at this point to answer my initial questions):

When run serially, e.g. without the openmp compiler option, with "-O3" we get the expected run time compared to our single subregion code with the same optimization level.

When we run serially without the no optimization, we get much slower runtime, as expected.

When run with the openmp compiler option, with our without the "-O3" option, we get much slower runtimes, on the order of 10 times slower than when the code was compiled with the "-O3" option alone, e.g. much closer to the results we get when compiling with no optimization at all.

Since we saw no speedup, we wanted to see if we were actually getting additional threads active for the openmp compiled code, so we ran "nmon" on our Ubuntu 18.04 system to "see" the activity level of all 8 threads (the CPU is a "Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz", e.g. 4 physical cores with hyperthreading with 8 total threads available.

If we run the code compiled without the openmp cmdline option, we would see one thread hit 100% utilization, as expected.

If we ran the code compiled with the openmp option, we would see all threads becoming active.

If I set the thread limit to 4, we would see only 4 threads active at one time, but not necessarily the same ones.

In some runs, we would generate a warning msg that we were trying to deallocate arrays that had previously been deallocated. The more threads that were active, the higher the chance was that we would encounter these messages.

Here is what I think we have learned:

  • We are getting additional threads active when compiled with the openmp option.
  • We were not getting the individual threads to run the entire sequence of calculations serially for a single subregion (loop index variable) as we originally anticipated. This was essentially confirmed from the deallocate messages as they could only occur from the running code with the same loop index variable (same subregion code).
  • Since we are setting up the parallel do construct for each day of the simulation and tearing it down, it is now obvious to us that we probably can make some coding changes to eliminate a lot of this recurring overhead and do a much better job of parallelizing this part of the code.  So, we have this task to to address in the future, once we get this step in the code to be parallelized correctly.
  • We are not sure if the optimization options are having any effect or not when used in conjunction with the openmp option.  We saw no noticeable (significant) differences in runtimes, but figured we needed to get the subregion code to run as desired when parallelized first before looking into this potential issue.

So, the fundamental question here is what do we need to do to get the individual subregions' code to run in parallel for all (four in this test case) subregions simultaneously on separate threads, but to run each individual subregion's instance serially within the do loop on each of those threads?  I am assuming we need to add additional OMP directives and/or clauses to keep the additional auto-parallelization from occurring in the code, but which directive(s) with what args and where in the code?

Thanks,

LEW

Tags (1)
0 Kudos
12 Replies
Highlighted
60 Views

>>When run with the openmp compiler option, with our without the "-O3" option, we get much slower runtimes, on the order of 10 times slower than when the code was compiled with the "-O3" option alone, e.g. much closer to the results we get when compiling with no optimization at all.

This behavior is generally caused by two circumstances:

a) The called code is calling upon a serializing function such a random number generator.
And/or
b) sever cache line evictions

With b) If your results become incorrect, then you are likely incorrectly sharing variables
With b) If your results are correct, then you may have highly active variables, while different for each thread, may co-reside within the same cache line. In this case, separate these variables.

Note, using VTune will point these issues out for you.

>>In some runs, we would generate a warning msg that we were trying to deallocate arrays that had previously been deallocated.

This is generally a programming error. Note, this can occur when the allocatable array is erroneously shared when it should be private.

>>So, the fundamental question here is what do we need to do to get the individual subregions' code to run in parallel

I see nothing wrong with your !$omp parallel do loop. The conflicting use must reside in submodules and those procedures which it calls.
(as long as  plants(isr)%plant, plants(isr)%plantIndex do not result in a writable object being written).

Additionally, if file writes are being performed, it is assumed (required) that each thread write to a different channel/file presumably name augmented with isr. While thread-safe shared writes can be correctly coded, this would present a serialization section to your code.

Jim Dempsey

0 Kudos
Highlighted
Beginner
60 Views

Thanks for the reply Jim.  I don't fully understand everything you've said at this time though. However, I think I put too much info into the original posting and therefore hid the real issue I am struggling with at this time.

We don't use a random number generator in this code, so that should not be an issue.  I have not looked to see if the results are correct because of the messages I was receiving about attempting to deallocate some temporary arrays that had previously been deallocated, which should never occur, and doesn't, when run serially.  It is possible that we have some file writing issues, but I don't think we should in this section of the code we are attempting to parallelize. If that is an issue, I will address it later.

Anyway, let me see if I can focus on the one issue of the attempted deallocation of arrays that have previously been deallocated as I think it hits upon the real question I need answered first.

As my first attempt to parallelize this do loop, I would like to have a separate thread created for each subregion (which corresponds to each of the do loop integer counter "isr" values). So, I want each value of "isr" to run its instance of the "call submodels()"  routine in parallel for each subregion, but to run the "call submodels()" routine itself serially within each individual thread.

However, that can't be occurring (the "call submodels()" code running serially on each thread) because the "crop_residue" arrays are temporary and are being created and then later destroyed within the very same routine. If that particular code is being run serially, we could not be getting messages about attempting to deallocate an array that has previously been deallocated. This tells me that we are likely getting additional parallelization to occur within the "call submodels()" code, which is not what I am wanting to occur.

So, in other words, I want my test file, which consists of 4 subregions, to parallelize this do loop so that each subregion's code (the "call submodels()" routine) runs serially on each of the subregion's individual threads, e.g. one thread per subregion only in this scenario, even if I have more threads available.  I thought that this was the only level or degree of parallelization that would occur with the OpenMP statements that I used.  However, that doesn't appear to be what is happening.

I hope I have better explained what I was expecting to occur and why I don't think it is in this case. That is why I was wondering if I needed to specify specific clauses to not get the additional parallelization to occur in the code.

LEW

0 Kudos
Highlighted
60 Views

>> because the "crop_residue" arrays are temporary and are being created and then later destroyed within the very same routine

Where are the temporary array descriptors located? (e.g. real, allocatable :: array(:))

What version of Fortran are you using?

Note, in older versions of Fortran, arrays defaulted to SAVE, which would make them shared. When a source file is compiled with -openmp it should make arrays automatic (and/or array descriptors) automatic (stack based). Some versions did not do this, or the user would omit using -openmp on source files that do not contain !$omp ... or use omp_lib thinking that it is not important. Assure that all Fortran sources are compiled with -openmp (including library files). Should this fail, add -auto (or -recursive) to all compilation of Fortran sources.

If you still receive the deallocation error, determine the array that is involved. It is either shared when you assume it is private to the thread, or it has been deallocated unexpectedly. Note, an allocatable array dummy argument without intent(out) or intent(inout) is:

intent(out):  automatically deallocated on entery (if allocated on entry) and remains allocated on exit
intent(in) or intent(inout): entry status left as-was
intent(none) (or no intent): the allocatable array is assumed to be local to the procedure, array descriptor initialized as unallocated, and auto deleted upon exit.

Seems strait forwared except consider: procedure foo with allocatable array calling procedure foobar with said allocatable array .AND. declaring the dummy as allocatable without intent listed.

Jim Dempsey

 

0 Kudos
Highlighted
Black Belt
60 Views

It looks like the basic "mental model" for parallelization of this code is correct, but (as Jim suggested) there are plenty of opportunities for "surprises".    A couple of notes that may help fill out the "mental model":

  • The program will begin execution with a single thread.
  • When the program execution reaches the first OMP PARALLEL region:
    • The runtime will query the environment to determine how many threads are requested, how many processors are available, whether you have requested thread binding, or non-default thread scheduling options, etc.
    • The "master" thread will spawn OMP_NUM_THREADS-1 additional threads (so a total of OMP_NUM_THREADS threads are available).
    • The OpenMP runtime will distribute the "nsubr" iterations of the loop across the OMP_NUM_THREADS threads and set them (and itself) to work.
    • When all the threads have finished, the master thread continues execution of the code after the parallel region, while the remaining threads continue to exist, but stay idle until the next parallel region.
  • For the remainder of the execution, the master thread executes the serial portion of the code, and when parallel regions are reached, the master thread sets the variables that the worker threads are watching to control when they are allowed to work on the next parallel region.
  • IMPORTANT NOTES!
    • If you have not requested that the OpenMP threads be "bound", then the operating system will control where the threads run.  With HyperThreading enabled, the OS is perfectly capable of running 4 threads on 2 cores and leaving the other two cores idle.   Some operating systems are smarter than others, but trusting them usually leads to frustration.   Fortunately, simply setting the environment variable "OMP_PROC_BIND=spread" will cause the OpenMP runtime to make the appropriate operating system calls to bind the threads to Logical Processors that are as far apart as possible.  In your system, 1 thread would be bound to each physical core, and the OS scheduler would not be allowed to move any of the threads to a different core.  (The option OMP_PROC_BIND=close would bind the threads to Logical Processors that are as close together as possible.  This can be useful if you want to test the performance scaling when using both Logical Processors on each physical core, while varying the number of physical cores from 1 to 2 to 3 to 4 -- using 2, 4, 6, and 8 OpenMP threads, respectively.)
    • For codes with this structure, it is often helpful to capture wall-clock time inside the parallel region.  When there is exactly 1 iteration per thread, this gives per-thread timing.   This allows you to compute differences in elapsed time across threads, and differences in thread timing for different days, etc.

If the answers are correct, the low performance does point toward "false sharing".   You should review the code to note which variables are updated by each function call, and then ensure that different threads are working on data in different cache lines.  This can be done in a couple of ways.  For array elements, it is sometimes easiest to simply copy to a private variable before the subroutine call and then copy back to the global variable on return.  Alternately, it may be easier to "pad" the array so that the threads are modifying elements that are at least 8 doubles apart.

"Dr. Bandwidth"
0 Kudos
Highlighted
Beginner
60 Views

Jim,

Where are the temporary arry descriptors located?

I've attached the source file that contains the allocation and deallocation routines. Here is a brief outline of the file contents with one of the array definitions that are part of the "crop_residue" type, etc:

module crop_data_struct_defs

[...]

  type crop_residue

[...]

    real, dimension(:), allocatable :: stemz    ! crop buried stem mass by layer (kg/m^2)

[...]

  end type crop_residue

contains

  function create_crop_residue(nsoillay) result(cropres)
     integer, intent(in) :: nsoillay
     type(crop_residue) :: cropres

[...]  !allocates arrays and initializes the values to zero

  end function create_crop_residue

  subroutine destroy_crop_residue(cropres)
     type(crop_residue), intent(inout) :: cropres

[...] !deallocates the arrays

  end subroutine destroy_crop_residue

end module crop_data_struct_defs

What version of Fortran are you using?

I am using Intel's Linux Fortran compiler version 19.0.0.254 Build 20190417

Statements about using "-openmp" while compiling all files and linking and using "omp_lib", etc.

I am compiling all source files with "-qopenmp" as well as linking with that option. The main routine is using "omp_lib" (and no others) and is the only routine that currently contains any "!$omp" statements at this time. Is that not correct or do I need to also specify "use omp_lib" in other src files as well?  I will try adding "-auto" to all compilations and see what that does. Hmm, performance died as it still has not printed out the first progress status statement (~17% complete) several minutes into the run.... I'll report on this later after it completes (if it does).

Information about using the "intent" statements, etc.

We should be using the "intent" statements as desired, but we may have missed one somewhere. I'll check those more closely too.

Larry Wagner

0 Kudos
Highlighted
60 Views

omp_lib need only to be present in the source files that call upon the OpenMP runtime library routines. These begin with omp_... (e.g. omp_get_wtime()).

Which statement(s) get the error?

Can you please show the important information from the file containing the !$omp... directives. Example

Program YourProgramName  ! or subroutine...
  use crop_data_struct_defs
  use omp_lib
  ...  ! remove unnecessary information
  real :: showNecessaryVariables
  type(...) :: showNecessaryUDTs
  ... ! remove unnecessary code
  !$omp parallel do ! inclusive of all private and shared clauses
  do ...
     { show the complete loop here }
 end do
  !$omp end parallel do

Also, be complete as to how/where the type(crop_residue) types are declared (instantiated). Your posted code shows how you create and destroy the allocatable contents, but do not show how and where the type(crop_residue) types are declared (instantiated).

**** Assuming error occurs in destroy_crop_residue with one of your:

     ! allocate below and above ground arrays
     if (allocated(cropres%stemz)) then                     ! *** is allocated here
        deallocate(cropres%stemz, stat=dealloc_stat) ! *** errors with already deallocated

or with other like cropres%.... arrays

Then this would imply either:

a) multiple threads are using the same cropres
    .or.
b) You called destroy_crop_residue with an undefined cropres objece. Note, undefined does not mean you forgot to call create_crop_residue....

Rather, it means the cropres object passed into destroy_crop_residue had never been initialized. The reference refers to junk data which apparently presents a junk array descriptor that looks both allocated and not allocated.

It would be informative to show code relating to use of the type(crop_residue) objects.

Note, there used to be a compiler error where a

!$omp parallel do private(cropres) ! defined locally but initialized (with unallocated internal arrays)
do I=...

Where the non-main thread local copies did not initialize the array descriptors.
*** This was presumably corrected

The corrective measure at that time was to use firstprivate(cropres) to copy in the contents of the unallocated array descriptors.
*** do not use  firstprivate(cropres) in the event that the main thread had initialized it, let me know so we can use a plan B.

Jim Dempsey

0 Kudos
Highlighted
Black Belt
60 Views

I forgot to mention that you should probably refrain from asking the compiler to perform automatic parallelization while you are trying to understand what is going on here.  I.e., use "-qopenmp", but not "-parallel" in the compilation.  This will ensure that the only loop(s) that are parallelized are the ones with the OpenMP directives that you have added to the source code.

"Dr. Bandwidth"
0 Kudos
Highlighted
Beginner
60 Views

Jim,

Ok, the "use omp_lib" is only speicified in the main src file, which is the only one that has any of the OpenMP runtime routines in it at this time.

Here is the main routine that contains the OpenMP directives (I have provided a copy of the entire routine as an attachment (weps.f95):

     program weps
     use omp_lib
     use weps_main_mod, only: old_run_file, run_rot_cycles, id, im, iy, ld, lm, ly, rootp, &
                               daysim, ijday, ljday, maxper, longest_mgt_rotation, ncycles, &
                               init_loop, calib_loop, report_loop, &
                               max_calib_cycles, calib_cycle, calib_done, &
                               am0ifl, wepsinit, cmdline
     use weps_submodel_mod, only: submodels, erodsubr_update
     [...]
     integer :: am0jd   ! Current julian day of simulation
     integer :: isr           ! This variable holds the subregion index
     type(soil_def), dimension(:), allocatable :: soil             ! structure with soil state and parameters as updated during simulation
                                                                                                    ! This gets allocated in this routine later when the data gets read in
     type(plants_struct), dimension(:), allocatable :: plants      ! array of pointers to structure for all biomaterial
                                                                                                           ! structure also references older biomaterial
                                                                                                           ! This gets allocated in this routine later when the data gets read
     [...]
         do am0jd = ijday,ljday
           [...]
            !$omp parallel do private(isr)
            do isr=1,nsubr   ! do multiple subregion
              call submodels(isr, soil(isr), plants(isr)%plant, plants(isr)%plantIndex, restot(isr), croptot(isr), &
                              biotot(isr), decompfac(isr), mandatbs(isr)%mandate, hstate(isr), h1et(isr), h1bal(isr), wp(isr), &
                              manFile(isr))
            end do
            !$omp end parallel do
           [...]
          end do

The array variables where I am sometimes getting the "attempting to deallocate already deallocated array" msgs occurs in the "destroy_crop_residue" routine. These variables are instantiated, allocated, used and then deallocated in only two routines (crop_mod.f95 and manage_mod.f95 - both provided as attachments). So, they are essentially used as temporary storage within those two routines.

Note then that the two routines that do the allocation and deallocation are not directly called from the main (weps) routine, but from other subroutines/functions.

I had not considered the possibility of the parallelized code trying to unallocate the arrays before they were initially allocated, but if the code was running serially here as I want, that should never occur.  The only routines that interact in any way with "crop_residue" objects are (crop_mod.f95, manage_mod.f95 and crop_data_struct_defs.f95) - all of which are provided as attachments.

Larry Wagner

0 Kudos
Highlighted
Beginner
60 Views

John,

Thanks for the detailed information on how to specify the number of threads to use, how to bind them appropriately to the processors, etc.  Also, the comment on not using "-parallel" in concert with the "-openmp"  compilation flag for my purposes is noted.

Larry Wagner

0 Kudos
Highlighted
60 Views

Have you compiled your code in Debug mode with all runtime checks and all compile time checks enabled? If not, then do so, and make a run. Report any errors during compile and run.

I do not see anything obvious as to the cause of the problem.

The unobvious thing would be if crop_mod or manage_mod were compiled such that the user defined types have implicit SAVE attribute (note this is the version you link with, not necessarily the version you see in your IDE). This would be caused if the linked in module .obj/.o files were compiled without -openmp.

Can you explain this:

manage_mod.f90

        cropres = create_crop_residue(soil%nslay)
        call dooper(manFile)
        ! do groups
        manFile%grp => manFile%oper%grpFirst
        do while ( associated(manFile%grp) )
          if(lastoper(sr)%skip.eq.0) then
            call dogroup(soil, plant, plantIndex, manFile)
            ! do processes
            manFile%proc => manFile%grp%procFirst
            do while ( associated(manFile%proc) )
              call doproc(soil, plant, biotot, mandate, hstate, h1et, manFile)
              ! next process
              manFile%proc => manFile%proc%procNext
            end do
            ! next group
            manFile%grp => manFile%grp%grpNext
          end if
        end do
        ! operation complete
        ! deallocate temporary crop residue structure
        call destroy_crop_residue(cropres)

Cropres is created and destroyed without being used???

In crop_mod, callcrop you do use the cropres object.

As a sanity check, at the top of callcrop and manage, add use omp_lib. then on each subroutine, first statement add

print *, omp_get_thread_num(), loc(cropres), "callcrop" ! or "manage" as the case may be

What you expect to see are different addresses for cropres by thread by subroutine.

If you do not see any output, then your are linking different object files (or using different library containing the different files).

Jim Dempsey

0 Kudos
Highlighted
Beginner
60 Views

Jim,

I have revised the debug compilation setup and now include some additional debug options. Doing so, I have discovered that we have one variable that is either being used when it is not yet initialized, or more likely is due to an invalid computation result that causes the variable to become a NAN.  I will run down this issue and fix it before moving forward with the OpenMP issues.

I will look into whether there are any implicit SAVE occurrences.  We are compiling all src files with the "-openmp" option though.

manage_mod.f90

This code calls dooper(manFile) which needs "cropres" to exist if it needs it.  The "dooper()" routine exists within "manage_mod.f90" (the full file is provided as an attachment named "manage_mod.f95.zip" file in the previous comment) and is located just above the "subroutine manage()" routine that creates/destroys the temporary data variables.

I'll add the print statement to see what threads are active when the "manage" and "callcrop" subroutines are called and in use as you suggest.

Thanks,

Larry Wagner

0 Kudos
Highlighted
60 Views

Additional note, not related to your problem.

If the number of calls to create_crop_residue and destroy_crop_residue is very large (and presents a significant amount of run time) Then consider making cropres a thread private "global" object (per thread) and then perform the cropres%... allocations only when the arrays are either not allocated (first time) or when nsoillay differs (and when, deallocate before allocate). You still need to init the arrays to 0.0. This will eliminate unnecessary allocate and deallocate (assuming nsoillay preponderantly remains the same).

Jim Dempsey

0 Kudos