- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I need some help parallelizing an implementation in fortran. I have a time-domain simulation program that we implemented in Fortran. The main structure is:
DO t=0,Nt !HAS TO BE IN SERIAL
.
.
.
SOME SERIAL CODE
DO I=1,N
CALL F(I)
END DO
SOME SERIAL CODE
DO I=1,N
CALL G(I)
END DO
SOME SERIAL CODE
.
.
.
END DO
The big DO has to be serial (it's the time incrementing). The inner DOs can be parallelized. N is usually big (8000-9000) and the F(), G(), ... functions are computationally intensive.
I tried parallelizing the internal DO-LOOPs with!$OMP PARALLEL DO directives. It runs ok (solved the data dependencies etc) BUT, it is a lot slower than running in serial!
-Is it because every time new threads are created and die and it indroduces overhead?
-Should I start with !$OMP PARALLEL in the beginning and use !$OMP DO-!$OMP END DO and protect the serial parts with !$OMP MASTER?
-If yes where should I put !$OMP PARALLEL? Before or after the big DO-LOOP (that needs to be serial)
Any suggestions?
Thanks people...
Link Copied
2 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How you can best parallelize this code depends on some factors that cannot be explained in a simple sketch of your code as you did in your opening thread message.
1) How many sections of serial code do you have?
2) How many threads are available (i.e. what is the ratio of threads to serial sections)?
3) of the subroutines being called in your serial loops, which are dependent on which others?
4) of the subroutines that are dependend on (former) earlier serial loops, which are dependent on the same element and which are dependent on the entire earlier loop completing?
5) other issues.
Parallization options that depend on the questions above
! each subroutine elementaly wise independent of other elements
! but may be dependent on sequence A-Z
do t=0,nT
!$omp parallel do
do i=1,N
call A(i)
call B(i)
...
call Z(i)
!$omp end parallel do
end do
! each loop run in parallel but in sequence A, B, ... Z
do t=0,nT
!$omp parallel do
do i=1,N
call A(i)
end do
!$omp end parallel do
!$omp parallel do
do i=1,N
call B(i)
end do
!$omp end parallel do
...
!$omp parallel do
do i=1,N
call Z(i)
end do
!$omp end parallel do
end do
! each loop run by one thread
do t=0,nT
!$omp parallel sections private(i)
do i=1,N
call A(i)
end do
!$omp section
do i=1,N
call B(i)
end do
!$omp section
...
!$omp end parallel sections
end do
The first method established one team,and slices all loops. This reduces the number of team start/stops.
The second method (I assume is what you currently are doing) increased the number of team start/stops.
The third method established one team, one for each loop, the effectiveness will depend on the number of loops and amound of processsing for each loop.
A forth method could be a variation of method 3 where you reorder and do more than one loop within a section.
An additional optimization may be available if you note in your serial code the early-on subroutines run completely independent of the later-on subroutines. An example might be performing the physics computations up to the point where positional data is updated, and after positional infromation is updated, you call graphics routines to render the scene. In this situation you would code something like
!$omp parallel private(t, i) ! create team outside t loop
do t=0,nT
!$omp sections ! note removal of parallel
do i=1,N
call A(i)
end do
!$omp section
...
!$omp end sections
!$omp barrier
!$omp do
do i=1,N
call AdvancePosition(i)
end do
!$omp end do
!$omp master
call Render()
!$omp end master
end do
!$omp end parallel
*** untested sketch above
What the above provides is for the additional threads to begin the next physics calculations during the render process (assuming rendering is not parallel)
Hopefully this will give you some hints as to where you can go.
Note, add this last step _after_ you get everything else working at top speed.
Jim Dempsey
1) How many sections of serial code do you have?
2) How many threads are available (i.e. what is the ratio of threads to serial sections)?
3) of the subroutines being called in your serial loops, which are dependent on which others?
4) of the subroutines that are dependend on (former) earlier serial loops, which are dependent on the same element and which are dependent on the entire earlier loop completing?
5) other issues.
Parallization options that depend on the questions above
! each subroutine elementaly wise independent of other elements
! but may be dependent on sequence A-Z
do t=0,nT
!$omp parallel do
do i=1,N
call A(i)
call B(i)
...
call Z(i)
!$omp end parallel do
end do
! each loop run in parallel but in sequence A, B, ... Z
do t=0,nT
!$omp parallel do
do i=1,N
call A(i)
end do
!$omp end parallel do
!$omp parallel do
do i=1,N
call B(i)
end do
!$omp end parallel do
...
!$omp parallel do
do i=1,N
call Z(i)
end do
!$omp end parallel do
end do
! each loop run by one thread
do t=0,nT
!$omp parallel sections private(i)
do i=1,N
call A(i)
end do
!$omp section
do i=1,N
call B(i)
end do
!$omp section
...
!$omp end parallel sections
end do
The first method established one team,and slices all loops. This reduces the number of team start/stops.
The second method (I assume is what you currently are doing) increased the number of team start/stops.
The third method established one team, one for each loop, the effectiveness will depend on the number of loops and amound of processsing for each loop.
A forth method could be a variation of method 3 where you reorder and do more than one loop within a section.
An additional optimization may be available if you note in your serial code the early-on subroutines run completely independent of the later-on subroutines. An example might be performing the physics computations up to the point where positional data is updated, and after positional infromation is updated, you call graphics routines to render the scene. In this situation you would code something like
!$omp parallel private(t, i) ! create team outside t loop
do t=0,nT
!$omp sections ! note removal of parallel
do i=1,N
call A(i)
end do
!$omp section
...
!$omp end sections
!$omp barrier
!$omp do
do i=1,N
call AdvancePosition(i)
end do
!$omp end do
!$omp master
call Render()
!$omp end master
end do
!$omp end parallel
*** untested sketch above
What the above provides is for the additional threads to begin the next physics calculations during the render process (assuming rendering is not parallel)
Hopefully this will give you some hints as to where you can go.
Note, add this last step _after_ you get everything else working at top speed.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks! I'll try the variations and try to fine tune them! I'm currently doing the 2nd one!
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page