Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

OpenMP no speedup

misty12
Beginner
711 Views
Hi,
I'm trying to parallelize the following cycle:
...
converged = .false.
r3=gam(2)*2.0
do while (converged .ne. .true.)
!===========================================
! This is parallel block #1
!$omp parallel private (t1,t2,s1,s2,t3,t4,t5,t6,q1,q2,r1,r2,r11,r22)
!$omp do
do i=0,nr-1
do j=0,nt-1

t1=ru(j,i,1)+run(j,i,1)
t5=ru(j,i,1)*ru(j,i,1)+run(j,i,1)*run(j,i,1)
t2=iu(j,i,1)+iun(j,i,1)
t6=iu(j,i,1)*iu(j,i,1)+iun(j,i,1)*iun(j,i,1)
q1=-gam(2)*(t5-t6)
r1=t5+t6
q2=r3*(iu(j,i,1)*ru(j,i,1)+iun(j,i,1)*run(j,i,1))
s1=ru(j,i,2)+run(j,i,2)
r2=ru(j,i,2)*ru(j,i,2)+run(j,i,2)*run(j,i,2)
s2=iu(j,i,2)+iun(j,i,2)
r2=r2+iu(j,i,2)*iu(j,i,2)+iun(j,i,2)*iun(j,i,2)
t3=gam(1)*(t1*s2-t2*s1)
t4=-gam(1)*(t1*s1+t2*s2)
r11=(r1+beta1*r2)*alf(1)
r22=(beta1*r1+r2)*alf(2)

fru(j,i,1)=t3+r11*t2
fiu(j,i,1)=t4-r11*t1
fru(j,i,2)=q2+r22*s2
fiu(j,i,2)=q1-r22*s1

end do
end do
!$omp end do nowait
!$omp end parallel
! End of parallel block #1
!===== ======================================
!
....
!
....
end do
...
Arrays are declared as follows:
double precision, allocatable, dimension(:,:,:):: ru,run,fru
double precision, allocatable, dimension(:,:,:):: iu,iun,fiun
allocate(ru(0:nt,0:nr,2),run(0:nt,0:nr,2),fru(0:nt,0:nr,2))
allocate(iu(0:nt,0:nr,2),iun(0:nt,0:nr,2),fiu(0:nt,0:nr,2))

nt=2048, nr=250

To estimate the speedup I've created to activities of Thread Profiler with number of threads equal to 1 and 2. The results of runs of these activities show that I have absolutely no speedup for the parallel block #1: 21sec (in case of 1 thread) and 20.9 sec (in case of 2 threads), while for the parallel block #2 speedup rate is more than 1.6. Am I doing smth wrong in the first parallel block?

Thanks in advance
0 Kudos
4 Replies
jimdempseyatthecove
Honored Contributor III
711 Views

Try

!$omp parallel do private (t1,t2,s1,s2,t3,t4,t5,t6,q1,q2,r1,r2,r11,r22) schedule(static,1)
do i=0,nr-1
do j=0,nt-1
...
end do
end do
!$omp end parallel do
Jim Dempsey

					
				
			
			
				
			
			
			
			
			
			
			
		
0 Kudos
Steve_Nuchia
New Contributor I
711 Views

How many times does the first loop construct execute? What is the average wall time per pass? There issubstantial overhead in the thread management and the convergence loop will incur that overhead on every pass.

It's more work but you might try creating your thread pool outside the convergence loop.

0 Kudos
Steve_Nuchia
New Contributor I
711 Views
Another point: calculate the total memory bandwidth of the calculation and compare it to the memory bandwidth of your system. If it is saturating the memory controller and/or the cache <-> register data paths with one thread it will run in pretty much the same time with more threads.
0 Kudos
Steve_Nuchia
New Contributor I
711 Views

make that shared cache <-> private cache data paths.

0 Kudos
Reply