PCHECK = 0.0D0
UCHECK = 0.0D0
VCHECK = 0.0D0
!$OMP PARALLEL DO REDUCTION(+:PCHECK,UCHECK,VCHECK)
DO 3500 JCHECK = 1,M
DO 4500 ICHECK = 1, N
PCHECK = PCHECK + (P(ICHECK,JCHECK))
UCHECK = UCHECK + (U(ICHECK,JCHECK))
VCHECK = VCHECK + (V(ICHECK,JCHECK))
U(JCHECK,JCHECK) = U(JCHECK,JCHECK)
1 * ( MOD (JCHECK, 100) /100.)
!$OMP END PARALLEL DO
I have avoided false sharing,and I have changed (N1,N2) to suitable
smaller value to match my processor's l1 cache line, but no performance
improvement can be gained.
I wonder whether my threading is something not suitable?
What do you mean by "changed (N1,N2) to suitable smaller value to match my processor's l1 cache line"? Are these not arrays that need to be a fixed size or are they temporaries that are used many times by filling them with real data?
What values do M and N have? (I assume that these are the same as N1 and N2.) If these are small, then you may be spending more time in parallel overhead that is eating up your parallel advantage. While threads are created only once,there is some time needed to "wake" them for each parallel region entered and put them back to sleep at the end of the parallel region.
It also looks like the computation in the loop you've shown will take the same amount of time for each iteration. When iterations can take different amounts of time, use of dynamic scheduling can achieve better load balance. However, there is a cost associated with dynamic scheduling and, so, it should be avoided unless there is some good reason to use it (unknown workload). I think you might be better to use a static schedule for the code shown.
Without a scheduling clause and using two threads, by default the iterations of the JCHECK loop would be divided in half: one thread would get iterations 1 to M/2-1 and the second thread would be assigned iterations M/2 to M. Using a schedule of (static,1) would assign the odd iterations to one thread and the even iterations to the second. The default schedule would allow prefetching to assist loading data into cache where the (static,1) would find the wrong data fetched between JCHECK iterations and then require the thread to pause while the correct data were read from memory. There is also a good chance that this is happening with your (dynamic,1) schedule, where threads are not likely to be assigned consecutive iterations and pay a penalty to make the assignment.
Make sure there are enough iterations to make threading worthwhile and try some different scheduling clauses to see if there is any improvement. Also, are you running on multiple processors or at least an HT enabled processor?
Are you able to measure parallel speedup on two physical CPU's? In general, parallel applications thatbenefit fromtwo physical processors also benefit from Hyper-Threading.
As Clay says in his response, make sure that M and N are large enough to merit parallel computing. Also, static scheduling is probably best for your code.
Bus saturation is something that can limit multithreaded performance but I don't think it's the culpirt in this case. I tested Clay's suggestion that there may not be enough work in the parallel region to merit the overhead of threads. For the code below, I don't see a parallel speedup until N = M >> 1000. However, the code shows nearly perfect speedup if N and M are large enough so bus saturation is not the limiting factor.
Best regards, Henry
integer, parameter :: M = 1000
double precision :: pcheck, ucheck, vcheck
real :: seconds
u = 1.0
v = 1.0
pcheck = 0.d0
ucheck = 0.d0
vcheck = 0.d0
c$omp+ private(icheck) schedule(static)
do jcheck = 1, M
do icheck = 1, N
pcheck = pcheck + p(icheck,jcheck)
ucheck = ucheck + u(icheck,jcheck)
vcheck = vcheck + v(icheck,jcheck)
u(jcheck,jcheck) = u(jcheck,jcheck) * (mod (jcheck,100)/100.0)
seconds = float (finish - start) / float (rate)
print*, 'Time = ',seconds,' seconds'
print*, pcheck, ucheck, vcheck
Message Edited by hagabb on 05-10-2004 03:59 PM