OpenMP overhead

onkelhotte · ‎11-26-2007

Hi there,
last month I visited an Intel Software Seminar about multithreading. I talked to Dr. Keilmann from Intel Germany about techniques, how I can add multithreading to existing projects and he told me that OpenMP is a very simple way to do that.

I analysed my current project andpointed out the routines which took the most CPU time. In that routines a lot of do loops occur, so I tried to parallelise them with

!$OMP PARALLEL DO

My machine is a single core P4 system so I expected no or only a little bit of a performance boost. To my surprise the calculation needed 20% longer than before. So I used my colleagues computer with a Pentium D dual core CPU. But there a performance impact was over 50%!

The parallised DO loops dont do much (only one longer multiply and add operation, here an example)

ix=1
!$OMP PARALLEL DO
do iy=2,nyStrangtt=nint(t2d(ix,iy))h2d(ix,iy)=h2d(ix,iy)-cy*wlz(tt,1)$ *(2.*t2d(ix,iy)-t2d(ix,iy-1)-t2d(ix,iy+1))$ -crx*wlz(tt,1)*(t2d<
FONT color=#ff00ff size=2>(ix,iy)-t2d(ix-1,iy))$ +cryr*aRandSeite*(tempRand(2,iy)-t2d(ix,iy))end do!$OMP END PARALLEL DO
Are these too few to less operations to parallise them (too much overhead?) or did I forgot
something else? I use WinXP SP2, VS2003, IVF10.1. I enabled OpenMP Conditional
Compilation and Process OpenMP Directives: Generate Parallel Code (/Qopenmp).
Furthermore a lot of "old" fortran routines my colleagues wrote some decades ago wont
function properly. For example a windowconfig struct crashes whenI call
status=setWindowConfig(wc) if I dont initialise every wc%... with 0.
Thanks in advance,
Markus

anthonyrichards · ‎11-26-2007

I think one of the problems is that the value for h2d(ix,iy) depends on values with index iy-1 and iy+1 as well as iy. By the way, you specify ix=1 and there is an expression t2d(ix-1,iy)) with index ix-1, which will havean index=0.

onkelhotte · ‎11-26-2007

The index can be 0, the array is defined as an array beginning with 0 as first element.

The values, where iy-1 will be used for is t2d and not h2d. Because I change h2d and not t2d I can use t2d(ix,iy-1) in my code without getting race condition.

TimP · ‎11-26-2007

Perhaps I missed your statement of the length of this loop. Parallelization seldom gives a big gain until the total length of the loop is about 5000, e.g. nested loops of length 200 with outer loop parallelized and inner loop vectorized. An exception may be where the case is skewed so as not to have good serial optimization, but good opportunities for parallelization.

jimdempseyatthecove · ‎11-26-2007

One potential problem is adverse cache interaction between the two threads.

Your arrays h2d(ix,iy) and t2d(ix,iy) are referenced with the second index varying most in the inner loop. In Fortran, the first index is the preferential index to vary most.

If you can redeclare your arrays such that you transposition the indexes then you might find your loops running much faster. And may parallelize better.

Note, if your code has 100's or 1000's of such references to h2d and/or t2d it may be problematic to make these edits. You can use the Fortran Preprocessor to re-order your indexes.

Either #define _REORDERXY in a common #include file or define it in the project or as a command line option.

! common #include file
#define _REORDERXY
#ifdef _REORDERXY
! use lower case
#define h2d(x,y) H2D(y,x)
#define t2d(x,y) T2D(y,x)
#endif

! module that declares array
! remember to #include common #include file that defines macros for h2d and t2d
#ifdef _REORDERXY
! use upper case
real :: H2D(0:DimY-1,0:DimX-1)
real :: T2D(0:DimY-1,0:DimX-1)
#else
! current way (using lower case)
real :: h2d(0:DimX-1,0:DimY-1)
real :: t2d(0:DimX-1,0:DimY-1)
#endif

Note,
The #define macros must be in an source file included into your files that have references to h2d and t2d.
These macros cannot be defined within a module and then used by source files USE-ing the module.

If your old code has references to the upper case names then change them to the lower case names. The lower case names will be used as the macro names and which now transpose the indexes. Also, when compiled without defining _REORDERXY then as Fortran normally does, the lower case name will reference the upper case name (and without transposing the indexes).

Jim Dempsey

reinhold-bader · ‎11-27-2007

You have done the analysis required for identifying shared vs. private entities in the code segment enclosed by the parallel region? At the very least,

!$OMP PARALLEL DO private(tt)

and possibly steps to keep the ix variable values consistent with serial execution.

Regards

onkelhotte · ‎12-03-2007

Thanks for your replies,
the arrays are not very big, ix is 30 and iy is 20 wide. So maybe these are too few elements for getting speed improvements with OpenMP?

Jim, thanks for your tip that varying the first element of an array is faster than varying the latter. But in my code I calculate "inner" elements of the martix at first and then the outer ones (left, right, top, buttom, 4 corners of the matrix). So resorting would be useless because in 50% of the cases one option is better than the other :-)

Reinhold, you seem to live in Germany either. It would be great if youcontact me so we could discuss OpenMP topics via email: onkelhotte at gmx.de
Im new to that topic and help in my motherlanguage is easier than in a foreign language.

Markus

jimdempseyatthecove · ‎12-03-2007

Markus,

Regarding reordering indexing.

It may be worthyour timeto analyze your calculation requirements and keep in mind that in addition to cache interaction issues that the IA32 and EM64T architectures have SIMD capabilities. That is the instruction set is capable of performing 2 or 4 floating point operations per instruction in parallel per core. With this in mind, an old algorithm that may have been written to conserve memory (less temp usage) or to conserve number of array accesses may run slower than possible due to inability to take advantage of SIMD.

Jim Dempsey

onkelhotte · ‎12-05-2007

Thanks Jim for the hint regarding reordering the index. It gave a little performance boost just because I changed the x to y.

With a working parallel executable for Pentium III cpus (linked with /O3 /Og /G6 /QaxK /Qparallel /Qopenmp) I measured the calculation time on my machine and on an other PC. With a little surprise:

My machine needs 71s for the calculation. cpu-z dump:
Processor 1 (ID = 0)
Number of cores1
Number of threads1 (max 2)
NameIntel Pentium 4
CodenameNorthwood
SpecificationIntel Pentium 4 CPU 3.20GHz
Instructions setsMMX, SSE, SSE2

My colleagues machine needs 86s for the calculation, but he is using a dual core:
Processor 1 (ID = 0)
Number of cores2
Number of threads2 (max 2)
NameIntel Pentium D 820
CodenameSmithField
SpecificationIntel Pentium D CPU 2.80GHz
Instructions setsMMX, SSE, SSE2, SSE3, EM64T

It is clear that 2 cores aint double speed, but it should be faster after all.Task manager says that both cores take 99% cpu time, when linked without /Qparallel and /Qopenmp both cores take 50% (why not 1 core 100%???)and 102s calculation time.
My opinion is that 2x2.8 GHz cores should be faster than 1x3.2 GHz core. Linking the executable with /G7 /QaxP /QxP for P4 SSE3 optimization boots the calculation to 84s on the dual core machine, which is not much better. Maybe Steve can say something about that behaviour.

Markus

TimP · ‎12-05-2007

If you run a single thread on 2 cores without setting KMP_AFFINITY or setting affinity in task manager, you will see each core taking 50%. Apparently, on Windows, each time your job is interrupted, it will switch cores if the other core is not busy. You might try the parallel openmp build with one thread (and 2) and affinity set. If your application is dependent on L1 cache locality, setting affinity may speed it up.

jimdempseyatthecove · ‎12-05-2007

Markus,

I compiled the program below (max speed settings) and ran it on two systems. Not having any idea of the sizes nor the initialized data I just cramed in junk data or 0's.

P4 530 (3.0GHz) Single Processor with HT sequential 66.75 sec.
P4 530 (3.0GHz) Single Processor with HTparallel 65.125 sec.
Opteron 270 (2.0GHz)Dual Processor, Dual Core sequential 91.25 sec.
Opteron 270 (2.0GHz)Dual Processor, Dual Core parallel 27.50 sec.

~ 3.32x faster with 4 threads

program Marcus
use omp_lib
implicit noneinteger :: ix, iy, tt
integer :: nyStrang
integer :: nPasses, nPass
real :: cy, crx, cryr, aRandSeite, tStart, tEnd
real, allocatable :: t2d(:,:)
real, allocatable :: h2d(:,:)
real, allocatable :: wlz(:,:)
real, allocatable :: tempRand(:,:)
! VariablesnPasses = 5000
nyStrang = 100000
allocate( &
& t2d(0:2,0:nyStrang+1), &
& h2d(0:2,0:nyStrang+1), &
& wlz(0:nyStrang+1,0:2), &
& tempRand(0:2,0:nyStrang+1))
t2d = 0.0
h2d = 0.0
wlz = 0.0
tempRand = 0.0
cy=1.23
crx = 3.45
cryr = 5.67
aRandSeite = 7.89
ix=1
tStart = omp_get_wtime()
do nPass=1,nPasses!$OMP PARALLEL DO default(shared) private(tt)do iy=2,nyStrang
tt=nint(t2d(ix,iy))
h2d(ix,iy)=h2d(ix,iy)-cy*wlz(tt,1) &
& *(2.*t2d(ix,iy)-t2d(ix,iy-1)-t2d(ix,iy+1)) &
& -crx*wlz(tt,1)*(t2d(ix,iy)-t2d(ix-1,iy)) &
& +cryr*aRandSeite*(tempRand(2,iy)-t2d(ix,iy))
end do!$OMP END PARALLEL DOend dotEnd = omp_get_wtime()
write(*,*) 'Run time', tEnd-tStart
deallocate(t2d,h2d,wlz,tempRand)end program Marcus
Jim Dempsey