OpenMP no speedup

misty12 · ‎06-02-2008

Hi,
I'm trying to parallelize the following cycle:
...
converged = .false.
r3=gam(2)*2.0
do while (converged .ne. .true.)
!===========================================
! This is parallel block #1
!$omp parallel private (t1,t2,s1,s2,t3,t4,t5,t6,q1,q2,r1,r2,r11,r22)
!$omp do
do i=0,nr-1
do j=0,nt-1

t1=ru(j,i,1)+run(j,i,1)
t5=ru(j,i,1)*ru(j,i,1)+run(j,i,1)*run(j,i,1)
t2=iu(j,i,1)+iun(j,i,1)
t6=iu(j,i,1)*iu(j,i,1)+iun(j,i,1)*iun(j,i,1)
q1=-gam(2)*(t5-t6)
r1=t5+t6
q2=r3*(iu(j,i,1)*ru(j,i,1)+iun(j,i,1)*run(j,i,1))
s1=ru(j,i,2)+run(j,i,2)
r2=ru(j,i,2)*ru(j,i,2)+run(j,i,2)*run(j,i,2)
s2=iu(j,i,2)+iun(j,i,2)
r2=r2+iu(j,i,2)*iu(j,i,2)+iun(j,i,2)*iun(j,i,2)
t3=gam(1)*(t1*s2-t2*s1)
t4=-gam(1)*(t1*s1+t2*s2)
r11=(r1+beta1*r2)*alf(1)
r22=(beta1*r1+r2)*alf(2)

fru(j,i,1)=t3+r11*t2
fiu(j,i,1)=t4-r11*t1
fru(j,i,2)=q2+r22*s2
fiu(j,i,2)=q1-r22*s1

end do
end do
!$omp end do nowait
!$omp end parallel
! End of parallel block #1
!===== ======================================
!
....
!
....
end do
...
Arrays are declared as follows:
double precision, allocatable, dimension(:,:,:):: ru,run,fru
double precision, allocatable, dimension(:,:,:):: iu,iun,fiun
allocate(ru(0:nt,0:nr,2),run(0:nt,0:nr,2),fru(0:nt,0:nr,2))
allocate(iu(0:nt,0:nr,2),iun(0:nt,0:nr,2),fiu(0:nt,0:nr,2))

nt=2048, nr=250

To estimate the speedup I've created to activities of Thread Profiler with number of threads equal to 1 and 2. The results of runs of these activities show that I have absolutely no speedup for the parallel block #1: 21sec (in case of 1 thread) and 20.9 sec (in case of 2 threads), while for the parallel block #2 speedup rate is more than 1.6. Am I doing smth wrong in the first parallel block?

Thanks in advance

jimdempseyatthecove · ‎06-04-2008

Try

!$omp parallel do private (t1,t2,s1,s2,t3,t4,t5,t6,q1,q2,r1,r2,r11,r22) schedule(static,1)
 do i=0,nr-1
 do j=0,nt-1
 ...
 end do
 end do
!$omp end parallel do

Jim Dempsey

Steve_Nuchia · ‎06-09-2008

How many times does the first loop construct execute? What is the average wall time per pass? There issubstantial overhead in the thread management and the convergence loop will incur that overhead on every pass.

It's more work but you might try creating your thread pool outside the convergence loop.

Steve_Nuchia · ‎06-10-2008

Another point: calculate the total memory bandwidth of the calculation and compare it to the memory bandwidth of your system. If it is saturating the memory controller and/or the cache <-> register data paths with one thread it will run in pretty much the same time with more threads.

Steve_Nuchia · ‎06-10-2008

make that shared cache <-> private cache data paths.