I have encountered a problem parallelising my code with OpenMP. I need to call an ODE solver (the rkc code of Sommeijer et al.: B.P. Sommeijer, L.F. Shampine and J.G. Verwer, RKC: an Explicit Solver for Parabolic PDEs. Technical Report MAS-R9715, CWI, Amsterdam, 1997) thousands of times each time step. This is agent-based code, and there are thousands of agents each with its own state to be updated.
I would like to be able to parallelise the solver calls for the agents, in fact I have done this, but I have realised that the program results with a single CPU (i.e. all serial) are significantly and consistently different from the results with 4 CPUs, and I have localised the problem to the parallelisation of the loop with the call to the ODE solver (OMP is used elsewhere in the program, but the results do not change when the other sections are made serial.)
The rkc code is pre-Fortran90, but works well. The thing that I suspect is the use of a subroutine (supplied by me) to evaluate the derivatives from t and the vector of variable values. I think this is a standard method used by ODE solvers. (The only reason I suspect this is because everything else seems normal, and I'm not sure how this method is expected to work when the rkc solver is being used in a parallel way.) This subroutine, f, is declared "external f" in the rkc code. I should mention that I have modified the rkc code somewhat to make it a module. I can provide more details if requested.
Thanks in advance.
When we combine a classical explicit integration formula such as RKC with agent-based simulation and/or parallelization, and on top of that we use classical RK step-size adjustment, we have to be careful not to break the rules of the RK formula. For example, when using the RK4 formula we have to make sure that k_2 is calculated over the entire domain of the PDE using f(t_n + h/2, y_n + h.k_1/2), and that the values of k_1, k_2, k_3, k_4 have to be evaluated with the proper arguments at every one of the grid nodes.
When the code is parallelized, we have to take care that one thread does not update a value before that value is no longer required by another thread.
If you are able to share (a condensed version, preferably) of your code, either as an attachment, or as a link to files stored on Dropbox, etc., one of the forum members here may be able to try the code and offer suggestions.
In addition to the consistent time domain concern of mecej4, together with use of old code, the old code may be written under the assumption that local arrays can default to SAVE .or. the code may consist of a collection of procedures that use COMMONs to pass/return values. Both potential multi-threading issues must be corrected for use in a multi-threaded solution. Additionally, if your results require the use of a pseudo random number generator, the parallel version is generally not deterministic.
Dear mecej4 and Jim, thanks very much for your responses.
mecej4: the code is rather complex, and it's not easy to make a small reproducer. I am simulating the growth of multicellular spheroids in vitro (human cancer cells.) The cells consume nutrients (currently 3) and oxygen from the medium, and in response grow and divide (or possibly die of starvation.) The core of the model is the sub-model for consumption of nutrients, growth and proliferation - we call this the "metabolism" model.
The comparison that I'm making between the different ncpu runs is the number of live cells after one day (starting with Ncells = 2000). Pseudo-random number generation is used in several places, therefore no two runs with ncpu > 1 yield exactly the same result, but I have found from multiple runs with different seeds that with ncpu = 1 the final Ncells is in the range 4400-4500, while with ncpu = 4 the range is 3600-3800.
I have located the parallelised section that when not parallelised leads to results consistent with the ncpu = 1 case. The subroutine call in this section computes the rates of consumption by each cell of the various constituents, and updates the intracellular concentrations. The extracellular concentrations are held constant over the time step. The computations for each cell should therefore be independent of each other, and as far as I can see any unwanted interactions must be caused by the way I'm using the rkc code. So far I have been unable to track down the source of the divergence - it's difficult because there are so many moving parts (literally - the cells move as the spheroid grows) and the only obvious measure, Ncells, is a long way downstream. I'm still trying to come up with a way to track it.
I will give a brief account of the modified rkc code I'm using.
The code initially contained one common block:
! RKCDID is a labelled common block that communicates statistics
! about the integration process:
! common /rkcdid/ nfe,nsteps,naccpt,nrejct,nfesig,maxm
I commented out these occurrences, and replaced the common block with a defined type rkc_comm that is passed as a subroutine argument. The beginning of the file now looks like this:
integer, parameter :: nflogg=12
double precision :: rkc_sprad
integer :: nfe,nsteps,naccpt,nrejct,nfesig,maxm
type(rkc_comm) :: comm
integer neqn,info(*),idid, icase
double precision y(neqn),t,tend,rtol,atol(*),work(*)
! uround is set here for IEEE double precision arithmetic.
double precision uround
double precision zero,rmax,rmin
! integer nfe,nsteps,naccpt,nrejct,nfesig,maxm
! common /rkcdid/ nfe,nsteps,naccpt,nrejct,nfesig,maxm
and comm is added as an argument to any subroutine that was using the common block.
I'll keep looking for a way to identify the origin of the divergence. This is an important issue, because some of the runs I want to do involve tens of thousands of cells, and parameter estimation will require a large number of runs. It makes sense to do this taking full advantage of parallelisation.
You are aware that you can set the random number generation to be repeatable? (Check the rotuine random_seed and https://stackoverflow.com/questions/51893720/correctly-setting-random-seeds-for-repeatability)
Thanks. The subroutines that are involved in the section that I've identified as the culprit are not using random numbers. If I don't parallelise this section, while leaving all other parallelisations untouched, the results with 4 cpus are the same as those with one.
It is unclear from your posting as to if the newly introduced type rkc_comm is use to construct a type(rkc_comm) structure that is used as shared or private within your OpenMP parallel regions. As well as if the type needs to be shared or private amongst the threads of a parallel region.
The rkc_comm type that is passed to rkc is not in the OMP parallel region, it is in the subroutine called in the parallel region.
!$omp end parallel do
!$omp parallel do private(cp, ic, ichemo, tstart, dtt, Kin, Kout, ok)
do kcell = 1,nlist
cp => cell_list(kcell)
if (cp%state == DEAD) cycle
tstart = 0
dtt = 2
!$omp end parallel do
integer :: kcell
real(REAL_KIND) :: tstart, dt
logical :: ok
integer :: ichemo, k, ict, neqn, i
real(REAL_KIND) :: t, tend
real(REAL_KIND) :: Cin(NUTS), C_GlnEx
type(cell_type), pointer :: cp
type(metabolism_type), pointer :: mp
! Variables for RKC
integer :: info(4), idid, res
real(REAL_KIND) :: rtol, atol(1)
type(rkc_comm) :: comm_rkc
real(REAL_KIND) :: work_rkc(8+5*NUTS*2)
where f_rkc_OGL is the function that computes the derivatives from Cin(:).
Your $omp loop is calling "CellSolver", your included code snip shows "OLGSolver" are these one and the same?
Your abbreviated code has an $omp loop calling CellSolver with the cell index kcell...
... does CellSolver (and what it calls) make use of cell_list(kcell:nlist)? IOW from kcell through nlist?
If it does, then the slices of cell_list for each thread overlaps with the other threads, and as such, the time states that are assumed (required) not to have advanced, in fact have advanced.
Sorry about that. In this case, CellSolver just calls OGLSolver.
kcell is used to access a cell in the list (cell_list(:) is an array of defined type cell_type)
In no case is a slice of the list accessed, and the computations for each cell are independent of each other, using only the intra- and extra-cellular concentrations for that cell. The extra-cellular concentrations are updated separately.
Many years ago, before the world went crazy, mecej4 helped me with a very large Fortran program which had bad coding, he improved it no end.
I suggest you load the whole code -- trying to reproduce errors in short code with the level of complexity you have would be difficult.
Just a thought
I found my bad coding error, which was in my own code (in f_rkc_OGL), not in the modified rkc code.
I was using a global variable to keep track of which cell was currently being processed. Not recommended even in serial mode, and a real no-no in multi-threaded code. The error showed up when I started writing out a lot of intermediate output, and I saw cell ID numbers that didn't make sense. After fixing that the results are now independent of the number of cpus.