- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Follow up to "!DEC$ PARALLEL" thread (since method changed to OpenMP).
Thanks to help received in the above thread and to a report by Meloni et al (2003)*, I have now obtained a worthwhile improvement in execution time using OpenMP parallel coding (see "!DEC$ PARALLEL" thread).
For one inner loop parallel coded, the reduction in analysis time is now ~25% when using three threads (limited execution to 3 threads on 4 cores so the PC remains responsive). Method to achieve the improved performance was to hand code an equivalent "Reduction" process for the accumulation array (illustrated below). Interestingly Meloni et al found their hand coded version more efficient than using standard $OMP ... REDUCTION coding.
Given that the threaded loop likely contributed only ~50% of the original CPU time, the 25% reduction in total analysis time equates to a scale factor of around 2 (for 3 threads). If the loop was only 40% of the total workload, the scale factor is close to 3. The 12% improvement previously obtained equates to equivalent scale factors of 1.3 & 1.4.
I will parallel code a couple of other inner loops, but the returns will diminish until I shift the parallel coding to the outermost loop. That will require restructuring parts of the program.
Thanks
David
######
The code that worked (illustration only):
Earlier in the program:
In the subject subroutine:
http://www.compunity.org/events/ewomp03/omptalks/Monday/Session3/T21p.pdf
Thanks to help received in the above thread and to a report by Meloni et al (2003)*, I have now obtained a worthwhile improvement in execution time using OpenMP parallel coding (see "!DEC$ PARALLEL" thread).
For one inner loop parallel coded, the reduction in analysis time is now ~25% when using three threads (limited execution to 3 threads on 4 cores so the PC remains responsive). Method to achieve the improved performance was to hand code an equivalent "Reduction" process for the accumulation array (illustrated below). Interestingly Meloni et al found their hand coded version more efficient than using standard $OMP ... REDUCTION coding.
Given that the threaded loop likely contributed only ~50% of the original CPU time, the 25% reduction in total analysis time equates to a scale factor of around 2 (for 3 threads). If the loop was only 40% of the total workload, the scale factor is close to 3. The 12% improvement previously obtained equates to equivalent scale factors of 1.3 & 1.4.
I will parallel code a couple of other inner loops, but the returns will diminish until I shift the parallel coding to the outermost loop. That will require restructuring parts of the program.
Thanks
David
######
The code that worked (illustration only):
Earlier in the program:
nTd = 1
!$ nTd = MIN(OMP_GET_MAX_THREADS( ), 3)
!$ CALL OMP_SET_NUM_THREADS(nTd)
ALLOCATE( Temp(15, nL ,nTd)) !for agg
In the subject subroutine:
Temp = 0.d0* Meloni et al (2003), Reduction on arrays: comparison of performances among different algorithms, EWOMP03.
!$OMP PARALLEL PRIVATE(iTd) SHARE(....)
!$ iTd = OMP_GET_THREAD_NUM() + 1
!$OMP DO PRIVATE(....)
DO i = 1,LargeNum
k = kv(i)
...
DO j = 1,15
&n bsp; ....
x = xFUNCTION(....)
p = ... (depends on x, k & j)
Temp(j,iloc(k),iTd ) = Temp(j,iloc(k),iTd )+p
END DO
END DO
!$OMP END DO
!$OMP END PARALLEL
Res = SUM( Temp, DIM=3 )
http://www.compunity.org/events/ewomp03/omptalks/Monday/Session3/T21p.pdf
Link Copied
0 Replies
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page