Parralellizing existing code using OpenMP directives

qolin · ‎01-04-2017

Am I barking up the wrong tree? Got the wrong end of the stick?

I have an existing program (22k LOC) that I am trying to parallelize using OpenMP.

Most of the CPU time is spent in and below a single routine "B" that is called some few hundreds of times, and that routine itself calls many others, nesting up to 6 levels of calls. Each call to B is required to solve for one of a number of objects. The calls to B are in another routine "A" that loops through all the objects, of which (depending on the problem definition) there are typically about 20. A gets called about 15 times by the solver, to converge on the solution to the problem.

All data describing the problem to be solved, and all the working variables and results, are held in one large structure "Z" that gets passed around as an argument to every routine. There are (almost) no global variables: no common, nothing saved in modules, apart from some input data that never gets changed during the solve. Z is allocated, and populated, in the main program, which then calls the solver. .

I started adding OpenMP directives yesterday. I am assuming I can add directives to A parallelize it's do-loop, approximately thus:

!$OMP PARALLEL

!$OMP DO

do i = 1, size(Z%objects)

call B ( Z, Z%nwrk%Oorder(i), airow(i), iflag, ierr )

enddo

!$OMP END DO

!$OMP END PARALLEL

The idea being, that each iteration will create a separate thread to work on the objects in parallel.

Needless to say, I am having problems. I get 4 threads running and they do indeed start working on the different objects, but it quickly goes wrong. However before I go into any more detail, I want to be sure I am on the right track, and not trying to do something really stupid.

Is this a reasonable way to parallelize this code?
Have I misunderstood something important concerning threads?
Do I need additional directives in B, and any/all of the routines it calls? (there is about 50 of these.)

Every example of directive-driven OpenMP that I have found differs radically from the sort of code layout I have got. They all deal with parallelizing a simple loop, with no additional subroutine calls within the loop.

I can show actual code, but (as always) there is a lot of it, so I'll leave that for now.

jimdempseyatthecove · ‎01-04-2017

While your overall philosophy is correct, there is likely some details that are getting in the way. The original code was written without provisions for parallel execution. One of the principal causes of problems in conversion of such code to run in parallel is to assure that each thread (conceptually each i'th iteration) does not adversely interfere with concurrent operation of a different i'th iteration. While you object data may be separated, often you will find workspace data (e.g. temporary arrays, and state variables), in your case held in Z, that are used/modified by all threads concurrently as if they had exclusive use. It will be your job to identify which, and then devise a means to separate these arrays/variables on a thread-by-thread basis..

In your above sketch, I assume Z%nwrk%Oorder(i) is your object reference, .AND. Z is a reference to some global state information .AND. scratch data (temporary arrays and scalars).

Without examining the code, it is difficult to make a best suggestion. IIF (If and only IF) the bulk of the data in Z exclusive of the object data is preponderantly scratch data (temporary arrays and scalars). Then

a) If not already, make your object array allocatable
b) Make the global Z an allocatable array of Z allocated to size of omp_get_max_threads()
c) rework the code to something like this:

type(Z_t), allocatable :: Z(:)
...
allocate(Z(0:omp_get_max_threads()-1)
...
CALL YourInit_Z(Z(0)) ! formerly YourInit_Z(Z)
! copy init-time parameters
DO i=1,omp_get_max_threads()-1
  Z(i)%SomeParameter = Z(0)%SomeParameter
  ...
END DO 
!$OMP PARALLEL 
!$OMP DO
  do i = 1, size(Z%objects)
    call B ( Z(omp_get_thread_num()), Z(0)%nwrk%Oorder(i), airow(i), iflag, ierr )
  enddo
!$OMP END DO
!$OMP END PARALLEL

The above has the advantage that everything in the call tree under B will likely not have to be changed. You can also perform a reduction after the parallel region should you require an accumulative action. The disadvantage is having duplicate parameters.

A more clean way, but more work, and potentially source of bug introduction, is to separate Z into two objects Zshared and Zworking.

type(Zworking_t), allocatable :: Zworking(:)
...
allocate(Zworking(0:omp_get_max_threads())
CALL YourInit_Z(Zshared)
!$OMP PARALLEL 
!$OMP DO
  do i = 1, size(Zshared%objects)
    call B ( Zshared, Zworking(omp_get_thread_num()), Zshared%nwrk%Oorder(i), airow(i), iflag, ierr )
  enddo
!$OMP END DO
!$OMP END PARALLEL

Jim Dempsey

qolin · ‎01-04-2017

Thanks Jim.

The code was, in fact, written with more than half an eye on it being parallelized at some time. Thus all workspace data and state variables are already held separately for each object, as allocatable structures and arrays under each object. The process of solving each object creates and destroys this data during the object solve. When the object solve is complete, the workspace data is deallocated, leaving just the object results, which again are held under the object. Thus it seems to me there is very little scope for 2 or more threads trying to access or write to the same variables... at least as far as Z is concerned.

Of course the solution routines also use local variables and arrays, but none of these are saved or declared in global modules. I was assuming all these would be "automatically" created separately for each thread...

jimdempseyatthecove · ‎01-04-2017

>>Of course the solution routines also use local variables and arrays, but none of these are saved or declared in global modules. I was assuming all these would be "automatically" created separately for each thread...

Then this does not explain: but it quickly goes wrong.

You should be aware that locally scoped arrays (those in subroutines and functions), when not compiled with recursive or openmp options, default to SAVE. Therefore, all procedures, and Fortran libraries, linked, must also be built with options that assure that locally declared arrays are on stack (with OpenMP enabled on all sources). And also assure that any 3rd party libraries are thread-safe.

If you can confirm that everything is compiled with the correct options (and linked to thread-safe libraries), then this indicates that your assumption about the dynamically modified data being thread-safe is incorrect.

Jim Dempsey

IanH · ‎01-04-2017

call B ( Z, Z%nwrk%Oorder(i), airow(i), iflag, ierr )

Nothing to do with OpenMP really, but does the subroutine modify either of its first two arguments?

Note iflag and ierr are shared, unless you explicitly declare them to be private.

qolin · ‎01-05-2017

Jim, Thanks again. Yes all routines are compiled for recursive code so locals should all be on (one of) the stack(s). It is linked with multithreaded libraries, and doesn't use 3rd-party code.

Encouraged by your advice yesterday, I spent most of the day working out what is going wrong, and indeed, there are a couple of itsy-bitsy parts of Z that must be shared between the objects; looks like these are getting muddled due to simultaneous read/write by multiple threads. I need to spend a couple more days re-organising things so these can be moved outside the parallel code, hopefully this will then allow it to work reliably. I'll keep you posted.

IanH: no, B doesn't change these args.

jimdempseyatthecove · ‎01-05-2017

>>so locals should all be on (one of) the stack(s).

This should read (each of).

>> couple of itsy-bitsy parts of Z

If these are temporary things, consider moving them into the object. This will cost a little more memory, but this usually is not an issue any more. Good luck in your fixup

Jim Dempsey

IanH · ‎01-05-2017

qolin wrote:

...there are a couple of itsy-bitsy parts of Z that must be shared between the objects; looks like these are getting muddled due to simultaneous read/write by multiple threads...

IanH: no, B doesn't change these args.

I'm not sure we are talking about the same arguments... Z is the first actual argument in the call to B and the first fragment quoted above suggests it is being modified. Just be mindful of Fortran's rules around aliasing between arguments.

qolin · ‎01-08-2017

Looks like I now have this working.

I reorganised the code to ensure the parts of Z that are shared between the objects, are updated outside the parallelised loop. I also had to unroll that loop into a nested 2-loop layout, with only the inner one running in parallel, and the outer one handling the shared data updating. Thanks again Jim.

IanH, yes sorry you are right. Z, the first argument, is the overall structure, wherein the whole world resides. Yes it gets updated, but not the part that is referenced in the 2nd arg. Anyway the call looks very different now. Thanks for your questions.

Qolin

jimdempseyatthecove · ‎01-09-2017

Qolin,

Do not forget the note IanH wrote about the arguments iflag and ierr (as written in #1) as being shared. If these values are initialized once outside the parallel region, then only updated should an error occur, then your code will likely run as intended. However, should the code inside the parallel region set "success" as well as "error", as the case may be, then you will have a last thread wins situation. This situation may not expose itself during testing.

Jim Dempsey

TimP · ‎01-09-2017

If you don't have the time to make your testing cover issues such as Jim mentioned, the Inspector tool included in the more comprehensive versions of Parallel Studio aims to do these things automatically.