How to correctly setup OpenMP directives for this kind of loop in Fortran

Larry_Wagner · ‎08-14-2019

We have a science model that simulates an agricultural field for susceptibility to wind erosion. The current version assumes the entire field consists of the same soil and that the entire field is treated with the same mgt practices. We have developed a new version that allows the user to specify multiple soil types and mgt practices for the field. Thus, rather than a single "subregion" (one soil and one set of mgt practices), we can now represent the field with multiple subregions, e.g. multiple soils and mgt practices. Of course, the code takes significantly longer to run when multiple subregions are being simulated when compiled and run serially. As a first step in trying to parallelize this code, we attempted to simulate each individual subregions' calculations on separate threads, or at least that is what we thought we were attempting (we have never done any parallelization of Fortran code before). Here is a brief description of what we have done to parallelize the do loop that steps through each subregion on a daily basis in the model:

do am0jd = ijday,lcaljday !step through each day of the simulation

[…]

!$omp parallel do

do isr=1,nsubr ! do multiple subregions' daily simulations

call submodels(isr, soil(isr), plants(isr)%plant, plants(isr)%plantIndex, restot(isr), croptot(isr), &

biotot(isr), decompfac(isr), mandatbs(isr)%mandate, hstate(isr), h1et(isr), h1bal(isr), wp(isr), manFile(isr))

[…]

end do

!$omp end parallel do

[…]

end do

As I understand, the "isr" variable is automatically private and all (or most?) other variables are shared by default, so we don't (shouldn't) need to state that explicitly in the directive. The test case we are running contains four subregions. There are a couple of issues we've discovered when running the code (sorry that I don't have the actual wallclock runtimes, but I don't think they are necessary at this point to answer my initial questions):

When run serially, e.g. without the openmp compiler option, with "-O3" we get the expected run time compared to our single subregion code with the same optimization level.
When we run serially without the no optimization, we get much slower runtime, as expected.
When run with the openmp compiler option, with our without the "-O3" option, we get much slower runtimes, on the order of 10 times slower than when the code was compiled with the "-O3" option alone, e.g. much closer to the results we get when compiling with no optimization at all.
Since we saw no speedup, we wanted to see if we were actually getting additional threads active for the openmp compiled code, so we ran "nmon" on our Ubuntu 18.04 system to "see" the activity level of all 8 threads (the CPU is a "Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz", e.g. 4 physical cores with hyperthreading with 8 total threads available.
1. If we run the code compiled without the openmp cmdline option, we would see one thread hit 100% utilization, as expected.
2. If we ran the code compiled with the openmp option, we would see all threads becoming active.
3. If I set the thread limit to 4, we would see only 4 threads active at one time, but not necessarily the same ones.
4. In some runs, we would generate a warning msg that we were trying to deallocate arrays that had previously been deallocated. The more threads that were active, the higher the chance was that we would encounter these messages.

Here is what we think we have learned:

We are getting additional threads active when compiled with the openmp option.
We were not getting the individual threads to run the entire sequence of calculations serially for a single subregion (loop index variable) as we originally anticipated. This was essentially confirmed from the deallocate messages as they could only occur from the running code with the same loop index variable (same subregion code).
Since we are setting up the parallel do construct for each day of the simulation and tearing it down, it is now obvious to us that we probably can make some coding changes to eliminate a lot of this recurring overhead and do a much better job of parallelizing this part of the code. So, we have this task to to address in the future, once we get this step in the code to be parallelized correctly.
We are not sure if the optimization options are having any effect or not when used in conjunction with the openmp option. We saw no noticeable (significant) differences in runtimes, but figured we needed to get the subregion code to run as desired when parallelized first before looking into this potential issue.

So, the fundamental question here is what do we need to do to get the individual subregions' code to run in parallel for all four subregions simultaneously on separate threads, but to run each individual subregion's instance serially on each of those threads? I am assuming we need to add additional OMP directives to keep the additional auto-parallelization from occurring in the code, but which directive(s) with what args and where in the code?

Thanks,

LEW