- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there,
last month I visited an Intel Software Seminar about multithreading. I talked to Dr. Keilmann from Intel Germany about techniques, how I can add multithreading to existing projects and he told me that OpenMP is a very simple way to do that.
I analysed my current project andpointed out the routines which took the most CPU time. In that routines a lot of do loops occur, so I tried to parallelise them with
!$OMP PARALLEL DO
My machine is a single core P4 system so I expected no or only a little bit of a performance boost. To my surprise the calculation needed 20% longer than before. So I used my colleagues computer with a Pentium D dual core CPU. But there a performance impact was over 50%!
The parallised DO loops dont do much (only one longer multiply and add operation, here an example)
ix=1
!$OMP PARALLEL DOdo iy=2,nyStrangtt=nint(t2d(ix,iy))h2d(ix,iy)=h2d(ix,iy)-cy*wlz(tt,1)$ *(2.*t2d(ix,iy)-t2d(ix,iy-1)-t2d(ix,iy+1))$ -crx*wlz(tt,1)*(t2d< FONT color=#ff00ff size=2>(ix,iy)-t2d(ix-1,iy))$ +cryr*aRandSeite*(tempRand(2,iy)-t2d(ix,iy))end do!$OMP END PARALLEL DO
Are these too few to less operations to parallise them (too much overhead?) or did I forgot
something else? I use WinXP SP2, VS2003, IVF10.1. I enabled OpenMP Conditional
Compilation and Process OpenMP Directives: Generate Parallel Code (/Qopenmp).
Furthermore a lot of "old" fortran routines my colleagues wrote some decades ago wont
function properly. For example a windowconfig struct crashes whenI call
status=setWindowConfig(wc) if I dont initialise every wc%... with 0.Thanks in advance,
Markus
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The index can be 0, the array is defined as an array beginning with 0 as first element.
The values, where iy-1 will be used for is t2d and not h2d. Because I change h2d and not t2d I can use t2d(ix,iy-1) in my code without getting race condition.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One potential problem is adverse cache interaction between the two threads.
Your arrays h2d(ix,iy) and t2d(ix,iy) are referenced with the second index varying most in the inner loop. In Fortran, the first index is the preferential index to vary most.
If you can redeclare your arrays such that you transposition the indexes then you might find your loops running much faster. And may parallelize better.
Note, if your code has 100's or 1000's of such references to h2d and/or t2d it may be problematic to make these edits. You can use the Fortran Preprocessor to re-order your indexes.
Either #define _REORDERXY in a common #include file or define it in the project or as a command line option.
! common #include file
#define _REORDERXY
#ifdef _REORDERXY
! use lower case
#define h2d(x,y) H2D(y,x)
#define t2d(x,y) T2D(y,x)
#endif
! module that declares array
! remember to #include common #include file that defines macros for h2d and t2d
#ifdef _REORDERXY
! use upper case
real :: H2D(0:DimY-1,0:DimX-1)
real :: T2D(0:DimY-1,0:DimX-1)
#else
! current way (using lower case)
real :: h2d(0:DimX-1,0:DimY-1)
real :: t2d(0:DimX-1,0:DimY-1)
#endif
Note,
The #define macros must be in an source file included into your files that have references to h2d and t2d.
These macros cannot be defined within a module and then used by source files USE-ing the module.
If your old code has references to the upper case names then change them to the lower case names. The lower case names will be used as the macro names and which now transpose the indexes. Also, when compiled without defining _REORDERXY then as Fortran normally does, the lower case name will reference the upper case name (and without transposing the indexes).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
!$OMP PARALLEL DO private(tt)and possibly steps to keep the ix variable values consistent with serial execution.
Regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your replies,
the arrays are not very big, ix is 30 and iy is 20 wide. So maybe these are too few elements for getting speed improvements with OpenMP?
Jim, thanks for your tip that varying the first element of an array is faster than varying the latter. But in my code I calculate "inner" elements of the martix at first and then the outer ones (left, right, top, buttom, 4 corners of the matrix). So resorting would be useless because in 50% of the cases one option is better than the other :-)
Reinhold, you seem to live in Germany either. It would be great if youcontact me so we could discuss OpenMP topics via email: onkelhotte at gmx.de
Im new to that topic and help in my motherlanguage is easier than in a foreign language.
Markus
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Markus,
Regarding reordering indexing.
It may be worthyour timeto analyze your calculation requirements and keep in mind that in addition to cache interaction issues that the IA32 and EM64T architectures have SIMD capabilities. That is the instruction set is capable of performing 2 or 4 floating point operations per instruction in parallel per core. With this in mind, an old algorithm that may have been written to conserve memory (less temp usage) or to conserve number of array accesses may run slower than possible due to inability to take advantage of SIMD.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Jim for the hint regarding reordering the index. It gave a little performance boost just because I changed the x to y.
With a working parallel executable for Pentium III cpus (linked with /O3 /Og /G6 /QaxK /Qparallel /Qopenmp) I measured the calculation time on my machine and on an other PC. With a little surprise:
My machine needs 71s for the calculation. cpu-z dump:
Processor 1 (ID = 0)
Number of cores1
Number of threads1 (max 2)
NameIntel Pentium 4
CodenameNorthwood
SpecificationIntel Pentium 4 CPU 3.20GHz
Instructions setsMMX, SSE, SSE2
My colleagues machine needs 86s for the calculation, but he is using a dual core:
Processor 1 (ID = 0)
Number of cores2
Number of threads2 (max 2)
NameIntel Pentium D 820
CodenameSmithField
SpecificationIntel Pentium D CPU 2.80GHz
Instructions setsMMX, SSE, SSE2, SSE3, EM64T
It is clear that 2 cores aint double speed, but it should be faster after all.Task manager says that both cores take 99% cpu time, when linked without /Qparallel and /Qopenmp both cores take 50% (why not 1 core 100%???)and 102s calculation time.
My opinion is that 2x2.8 GHz cores should be faster than 1x3.2 GHz core. Linking the executable with /G7 /QaxP /QxP for P4 SSE3 optimization boots the calculation to 84s on the dual core machine, which is not much better. Maybe Steve can say something about that behaviour.
Markus
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Markus,
I compiled the program below (max speed settings) and ran it on two systems. Not having any idea of the sizes nor the initialized data I just cramed in junk data or 0's.
P4 530 (3.0GHz) Single Processor with HT sequential 66.75 sec.
P4 530 (3.0GHz) Single Processor with HTparallel 65.125 sec.
Opteron 270 (2.0GHz)Dual Processor, Dual Core sequential 91.25 sec.
Opteron 270 (2.0GHz)Dual Processor, Dual Core parallel 27.50 sec.
~ 3.32x faster with 4 threads
program Marcususe omp_libimplicit noneinteger :: ix, iy, ttinteger :: nyStranginteger :: nPasses, nPassreal :: cy, crx, cryr, aRandSeite, tStart, tEndreal, allocatable :: t2d(:,:)real, allocatable :: h2d(:,:)real, allocatable :: wlz(:,:)real, allocatable :: tempRand(:,:)! VariablesnPasses = 5000
nyStrang = 100000
allocate( && t2d(0:2,0:nyStrang+1), &
& h2d(0:2,0:nyStrang+1), &
& wlz(0:nyStrang+1,0:2), &
& tempRand(0:2,0:nyStrang+1))
t2d = 0.0
h2d = 0.0
wlz = 0.0
tempRand = 0.0
cy=1.23
crx = 3.45
cryr = 5.67
aRandSeite = 7.89
ix=1
tStart = omp_get_wtime()
do nPass=1,nPasses!$OMP PARALLEL DO default(shared) private(tt)
do iy=2,nyStrangtt=
nint(t2d(ix,iy))h2d(ix,iy)=h2d(ix,iy)-cy*wlz(tt,1) &
& *(2.*t2d(ix,iy)-t2d(ix,iy-1)-t2d(ix,iy+1)) &
& -crx*wlz(tt,1)*(t2d(ix,iy)-t2d(ix-1,iy)) &
& +cryr*aRandSeite*(tempRand(2,iy)-t2d(ix,iy))
end do!$OMP END PARALLEL DO
end dotEnd = omp_get_wtime()
write(*,*) 'Run time', tEnd-tStartdeallocate(t2d,h2d,wlz,tempRand)end
program Marcus
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page