OpenMP not improve my program?

isn-removed938 · ‎04-11-2004

I use OpenMP Fortran API to thread my program,and compiled it with Intel Fortran v8 with openmp flag enabled. But it shows performance degradationon my Pentium4 2.8E(Prescott) processor with Hyper-Threading enabled.This is the significant part of my code:

COMMON U(N1,N2), V(N1,N2), P(N1,N2)
PCHECK = 0.0D0
UCHECK = 0.0D0
VCHECK = 0.0D0
..........

!$OMP PARALLEL DO REDUCTION(+:PCHECK,UCHECK,VCHECK)
!$OMP+schedule(dynamic,1)
DO 3500 JCHECK = 1,M
DO 4500 ICHECK = 1, N
PCHECK = PCHECK + (P(ICHECK,JCHECK))
UCHECK = UCHECK + (U(ICHECK,JCHECK))
VCHECK = VCHECK + (V(ICHECK,JCHECK))
4500 CONTINUE
U(JCHECK,JCHECK) = U(JCHECK,JCHECK)
1 * ( MOD (JCHECK, 100) /100.)
3500 CONTINUE
!$OMP END PARALLEL DO
...........
I have avoided false sharing,and I have changed (N1,N2) to suitable
smaller value to match my processor's l1 cache line, but no performance
improvement can be gained.
I wonder whether my threading is something not suitable?

ClayB · ‎04-12-2004

Ronny -

What do you mean by "changed (N1,N2) to suitable smaller value to match my processor's l1 cache line"? Are these not arrays that need to be a fixed size or are they temporaries that are used many times by filling them with real data?

What values do M and N have? (I assume that these are the same as N1 and N2.) If these are small, then you may be spending more time in parallel overhead that is eating up your parallel advantage. While threads are created only once,there is some time needed to "wake" them for each parallel region entered and put them back to sleep at the end of the parallel region.

It also looks like the computation in the loop you've shown will take the same amount of time for each iteration. When iterations can take different amounts of time, use of dynamic scheduling can achieve better load balance. However, there is a cost associated with dynamic scheduling and, so, it should be avoided unless there is some good reason to use it (unknown workload). I think you might be better to use a static schedule for the code shown.

Without a scheduling clause and using two threads, by default the iterations of the JCHECK loop would be divided in half: one thread would get iterations 1 to M/2-1 and the second thread would be assigned iterations M/2 to M. Using a schedule of (static,1) would assign the odd iterations to one thread and the even iterations to the second. The default schedule would allow prefetching to assist loading data into cache where the (static,1) would find the wrong data fetched between JCHECK iterations and then require the thread to pause while the correct data were read from memory. There is also a good chance that this is happening with your (dynamic,1) schedule, where threads are not likely to be assigned consecutive iterations and pay a penalty to make the assignment.

Make sure there are enough iterations to make threading worthwhile and try some different scheduling clauses to see if there is any improvement. Also, are you running on multiple processors or at least an HT enabled processor?

-- clay

Henry_G_Intel · ‎04-19-2004

Hi Ronny,

Are you able to measure parallel speedup on two physical CPU's? In general, parallel applications thatbenefit fromtwo physical processors also benefit from Hyper-Threading.

As Clay says in his response, make sure that M and N are large enough to merit parallel computing. Also, static scheduling is probably best for your code.

Henry

jamesqf · ‎05-07-2004

My guessis that you mightbe running up against amemory bandwidth limit. The bus has a max rate: if it's maxing out feeding one thread/CPU, two won't make it go any faster.

Try timing the loop, and figure out the data rate. If it's close to your bus speed, you've found the problem.

Henry_G_Intel · ‎05-10-2004

Hi James,

Bus saturation is something that can limit multithreaded performance but I don't think it's the culpirt in this case. I tested Clay's suggestion that there may not be enough work in the parallel region to merit the overhead of threads. For the code below, I don't see a parallel speedup until N = M >> 1000. However, the code shows nearly perfect speedup if N and M are large enough so bus saturation is not the limiting factor.

Best regards, Henry

program ids
implicit none

integer, parameter :: N = 1000
integer, parameter :: M = 1000

real, dimension(N,M) :: p, u, v
double precision :: pcheck, ucheck, vcheck

integer :: jcheck, icheck, start, finish, rate
real :: seconds

common p, u, v

p = 1.0
u = 1.0
v = 1.0
pcheck = 0.d0
ucheck = 0.d0
vcheck = 0.d0

call system_clock (COUNT = start)

c$omp parallel do reduction(+:pcheck, ucheck, vcheck)
c$omp+ private(icheck) schedule(static)
do jcheck = 1, M
do icheck = 1, N
pcheck = pcheck + p(icheck,jcheck)
ucheck = ucheck + u(icheck,jcheck)
vcheck = vcheck + v(icheck,jcheck)
enddo
u(jcheck,jcheck) = u(jcheck,jcheck) * (mod (jcheck,100)/100.0)
enddo

call system_clock (COUNT = finish, COUNT_RATE = rate)
seconds = float (finish - start) / float (rate)
print*, 'Time = ',seconds,' seconds'

print*
print*, pcheck, ucheck, vcheck

end

Message Edited by hagabb on 05-10-2004 03:59 PM