Am I barking up the wrong tree? Got the wrong end of the stick?
I have an existing program (22k LOC) that I am trying to parallelize using OpenMP.
- Is this a reasonable way to parallelize this code?
- Have I misunderstood something important concerning threads?
- Do I need additional directives in B, and any/all of the routines it calls? (there is about 50 of these.)
While your overall philosophy is correct, there is likely some details that are getting in the way. The original code was written without provisions for parallel execution. One of the principal causes of problems in conversion of such code to run in parallel is to assure that each thread (conceptually each i'th iteration) does not adversely interfere with concurrent operation of a different i'th iteration. While you object data may be separated, often you will find workspace data (e.g. temporary arrays, and state variables), in your case held in Z, that are used/modified by all threads concurrently as if they had exclusive use. It will be your job to identify which, and then devise a means to separate these arrays/variables on a thread-by-thread basis..
In your above sketch, I assume Z%nwrk%Oorder(i) is your object reference, .AND. Z is a reference to some global state information .AND. scratch data (temporary arrays and scalars).
Without examining the code, it is difficult to make a best suggestion. IIF (If and only IF) the bulk of the data in Z exclusive of the object data is preponderantly scratch data (temporary arrays and scalars). Then
a) If not already, make your object array allocatable
b) Make the global Z an allocatable array of Z allocated to size of omp_get_max_threads()
c) rework the code to something like this:
type(Z_t), allocatable :: Z(:) ... allocate(Z(0:omp_get_max_threads()-1) ... CALL YourInit_Z(Z(0)) ! formerly YourInit_Z(Z) ! copy init-time parameters DO i=1,omp_get_max_threads()-1 Z(i)%SomeParameter = Z(0)%SomeParameter ... END DO !$OMP PARALLEL !$OMP DO do i = 1, size(Z%objects) call B ( Z(omp_get_thread_num()), Z(0)%nwrk%Oorder(i), airow(i), iflag, ierr ) enddo !$OMP END DO !$OMP END PARALLEL
The above has the advantage that everything in the call tree under B will likely not have to be changed. You can also perform a reduction after the parallel region should you require an accumulative action. The disadvantage is having duplicate parameters.
A more clean way, but more work, and potentially source of bug introduction, is to separate Z into two objects Zshared and Zworking.
type(Zworking_t), allocatable :: Zworking(:) ... allocate(Zworking(0:omp_get_max_threads()) CALL YourInit_Z(Zshared) !$OMP PARALLEL !$OMP DO do i = 1, size(Zshared%objects) call B ( Zshared, Zworking(omp_get_thread_num()), Zshared%nwrk%Oorder(i), airow(i), iflag, ierr ) enddo !$OMP END DO !$OMP END PARALLEL
The code was, in fact, written with more than half an eye on it being parallelized at some time. Thus all workspace data and state variables are already held separately for each object, as allocatable structures and arrays under each object. The process of solving each object creates and destroys this data during the object solve. When the object solve is complete, the workspace data is deallocated, leaving just the object results, which again are held under the object. Thus it seems to me there is very little scope for 2 or more threads trying to access or write to the same variables... at least as far as Z is concerned.
Of course the solution routines also use local variables and arrays, but none of these are saved or declared in global modules. I was assuming all these would be "automatically" created separately for each thread...
>>Of course the solution routines also use local variables and arrays, but none of these are saved or declared in global modules. I was assuming all these would be "automatically" created separately for each thread...
Then this does not explain: but it quickly goes wrong.
You should be aware that locally scoped arrays (those in subroutines and functions), when not compiled with recursive or openmp options, default to SAVE. Therefore, all procedures, and Fortran libraries, linked, must also be built with options that assure that locally declared arrays are on stack (with OpenMP enabled on all sources). And also assure that any 3rd party libraries are thread-safe.
If you can confirm that everything is compiled with the correct options (and linked to thread-safe libraries), then this indicates that your assumption about the dynamically modified data being thread-safe is incorrect.
call B ( Z, Z%nwrk%Oorder(i), airow(i), iflag, ierr )
Nothing to do with OpenMP really, but does the subroutine modify either of its first two arguments?
Note iflag and ierr are shared, unless you explicitly declare them to be private.
Jim, Thanks again. Yes all routines are compiled for recursive code so locals should all be on (one of) the stack(s). It is linked with multithreaded libraries, and doesn't use 3rd-party code.
Encouraged by your advice yesterday, I spent most of the day working out what is going wrong, and indeed, there are a couple of itsy-bitsy parts of Z that must be shared between the objects; looks like these are getting muddled due to simultaneous read/write by multiple threads. I need to spend a couple more days re-organising things so these can be moved outside the parallel code, hopefully this will then allow it to work reliably. I'll keep you posted.
IanH: no, B doesn't change these args.
>>so locals should all be on (one of) the stack(s).
This should read (each of).
>> couple of itsy-bitsy parts of Z
If these are temporary things, consider moving them into the object. This will cost a little more memory, but this usually is not an issue any more. Good luck in your fixup
...there are a couple of itsy-bitsy parts of Z that must be shared between the objects; looks like these are getting muddled due to simultaneous read/write by multiple threads...
IanH: no, B doesn't change these args.
I'm not sure we are talking about the same arguments... Z is the first actual argument in the call to B and the first fragment quoted above suggests it is being modified. Just be mindful of Fortran's rules around aliasing between arguments.
Looks like I now have this working.
I reorganised the code to ensure the parts of Z that are shared between the objects, are updated outside the parallelised loop. I also had to unroll that loop into a nested 2-loop layout, with only the inner one running in parallel, and the outer one handling the shared data updating. Thanks again Jim.
IanH, yes sorry you are right. Z, the first argument, is the overall structure, wherein the whole world resides. Yes it gets updated, but not the part that is referenced in the 2nd arg. Anyway the call looks very different now. Thanks for your questions.
Do not forget the note IanH wrote about the arguments iflag and ierr (as written in #1) as being shared. If these values are initialized once outside the parallel region, then only updated should an error occur, then your code will likely run as intended. However, should the code inside the parallel region set "success" as well as "error", as the case may be, then you will have a last thread wins situation. This situation may not expose itself during testing.
If you don't have the time to make your testing cover issues such as Jim mentioned, the Inspector tool included in the more comprehensive versions of Parallel Studio aims to do these things automatically.