How you can best parallelize this code depends on some factors that cannot be explained in a simple sketch of your code as you did in your opening thread message.
1) How many sections of serial code do you have? 2) How many threads are available (i.e. what is the ratio of threads to serial sections)? 3) of the subroutines being called in your serial loops, which are dependent on which others? 4) of the subroutines that are dependend on (former) earlier serial loops, which are dependent on the same element and which are dependent on the entire earlier loop completing? 5) other issues.
Parallization options that depend on the questions above
! each subroutine elementaly wise independent of other elements ! but may be dependent on sequence A-Z do t=0,nT !$omp parallel do do i=1,N call A(i) call B(i) ... call Z(i) !$omp end parallel do end do
! each loop run in parallel but in sequence A, B, ... Z do t=0,nT !$omp parallel do do i=1,N call A(i) end do !$omp end parallel do !$omp parallel do do i=1,N call B(i) end do !$omp end parallel do ... !$omp parallel do do i=1,N call Z(i) end do !$omp end parallel do end do
! each loop run by one thread do t=0,nT !$omp parallel sections private(i) do i=1,N call A(i) end do !$omp section do i=1,N call B(i) end do !$omp section ... !$omp end parallel sections end do
The first method established one team,and slices all loops. This reduces the number of team start/stops. The second method (I assume is what you currently are doing) increased the number of team start/stops. The third method established one team, one for each loop, the effectiveness will depend on the number of loops and amound of processsing for each loop.
A forth method could be a variation of method 3 where you reorder and do more than one loop within a section.
An additional optimization may be available if you note in your serial code the early-on subroutines run completely independent of the later-on subroutines. An example might be performing the physics computations up to the point where positional data is updated, and after positional infromation is updated, you call graphics routines to render the scene. In this situation you would code something like
!$omp parallel private(t, i) ! create team outside t loop do t=0,nT !$omp sections ! note removal of parallel do i=1,N call A(i) end do !$omp section ... !$omp end sections !$omp barrier !$omp do do i=1,N call AdvancePosition(i) end do !$omp end do !$omp master call Render() !$omp end master end do !$omp end parallel
*** untested sketch above
What the above provides is for the additional threads to begin the next physics calculations during the render process (assuming rendering is not parallel)
Hopefully this will give you some hints as to where you can go.
Note, add this last step _after_ you get everything else working at top speed.