- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm trying to parallelize the following cycle:
...
converged = .false.
r3=gam(2)*2.0
do while (converged .ne. .true.)
!===========================================
! This is parallel block #1
!$omp parallel private (t1,t2,s1,s2,t3,t4,t5,t6,q1,q2,r1,r2,r11,r22)
!$omp do
do i=0,nr-1
do j=0,nt-1
t1=ru(j,i,1)+run(j,i,1)
t5=ru(j,i,1)*ru(j,i,1)+run(j,i,1)*run(j,i,1)
t2=iu(j,i,1)+iun(j,i,1)
t6=iu(j,i,1)*iu(j,i,1)+iun(j,i,1)*iun(j,i,1)
q1=-gam(2)*(t5-t6)
r1=t5+t6
q2=r3*(iu(j,i,1)*ru(j,i,1)+iun(j,i,1)*run(j,i,1))
s1=ru(j,i,2)+run(j,i,2)
r2=ru(j,i,2)*ru(j,i,2)+run(j,i,2)*run(j,i,2)
s2=iu(j,i,2)+iun(j,i,2)
r2=r2+iu(j,i,2)*iu(j,i,2)+iun(j,i,2)*iun(j,i,2)
t3=gam(1)*(t1*s2-t2*s1)
t4=-gam(1)*(t1*s1+t2*s2)
r11=(r1+beta1*r2)*alf(1)
r22=(beta1*r1+r2)*alf(2)
fru(j,i,1)=t3+r11*t2
fiu(j,i,1)=t4-r11*t1
fru(j,i,2)=q2+r22*s2
fiu(j,i,2)=q1-r22*s1
end do
end do
!$omp end do nowait
!$omp end parallel
! End of parallel block #1
!===== ======================================
!
....
!
....
end do
...
Arrays are declared as follows:
double precision, allocatable, dimension(:,:,:):: ru,run,fru
double precision, allocatable, dimension(:,:,:):: iu,iun,fiun
allocate(ru(0:nt,0:nr,2),run(0:nt,0:nr,2),fru(0:nt,0:nr,2))
allocate(iu(0:nt,0:nr,2),iun(0:nt,0:nr,2),fiu(0:nt,0:nr,2))
nt=2048, nr=250
To estimate the speedup I've created to activities of Thread Profiler with number of threads equal to 1 and 2. The results of runs of these activities show that I have absolutely no speedup for the parallel block #1: 21sec (in case of 1 thread) and 20.9 sec (in case of 2 threads), while for the parallel block #2 speedup rate is more than 1.6. Am I doing smth wrong in the first parallel block?
Thanks in advance
I'm trying to parallelize the following cycle:
...
converged = .false.
r3=gam(2)*2.0
do while (converged .ne. .true.)
!===========================================
! This is parallel block #1
!$omp parallel private (t1,t2,s1,s2,t3,t4,t5,t6,q1,q2,r1,r2,r11,r22)
!$omp do
do i=0,nr-1
do j=0,nt-1
t1=ru(j,i,1)+run(j,i,1)
t5=ru(j,i,1)*ru(j,i,1)+run(j,i,1)*run(j,i,1)
t2=iu(j,i,1)+iun(j,i,1)
t6=iu(j,i,1)*iu(j,i,1)+iun(j,i,1)*iun(j,i,1)
q1=-gam(2)*(t5-t6)
r1=t5+t6
q2=r3*(iu(j,i,1)*ru(j,i,1)+iun(j,i,1)*run(j,i,1))
s1=ru(j,i,2)+run(j,i,2)
r2=ru(j,i,2)*ru(j,i,2)+run(j,i,2)*run(j,i,2)
s2=iu(j,i,2)+iun(j,i,2)
r2=r2+iu(j,i,2)*iu(j,i,2)+iun(j,i,2)*iun(j,i,2)
t3=gam(1)*(t1*s2-t2*s1)
t4=-gam(1)*(t1*s1+t2*s2)
r11=(r1+beta1*r2)*alf(1)
r22=(beta1*r1+r2)*alf(2)
fru(j,i,1)=t3+r11*t2
fiu(j,i,1)=t4-r11*t1
fru(j,i,2)=q2+r22*s2
fiu(j,i,2)=q1-r22*s1
end do
end do
!$omp end do nowait
!$omp end parallel
! End of parallel block #1
!===== ======================================
!
....
!
....
end do
...
Arrays are declared as follows:
double precision, allocatable, dimension(:,:,:):: ru,run,fru
double precision, allocatable, dimension(:,:,:):: iu,iun,fiun
allocate(ru(0:nt,0:nr,2),run(0:nt,0:nr,2),fru(0:nt,0:nr,2))
allocate(iu(0:nt,0:nr,2),iun(0:nt,0:nr,2),fiu(0:nt,0:nr,2))
nt=2048, nr=250
To estimate the speedup I've created to activities of Thread Profiler with number of threads equal to 1 and 2. The results of runs of these activities show that I have absolutely no speedup for the parallel block #1: 21sec (in case of 1 thread) and 20.9 sec (in case of 2 threads), while for the parallel block #2 speedup rate is more than 1.6. Am I doing smth wrong in the first parallel block?
Thanks in advance
Link Copied
4 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try
!$omp parallel do private (t1,t2,s1,s2,t3,t4,t5,t6,q1,q2,r1,r2,r11,r22) schedule(static,1)
do i=0,nr-1
do j=0,nt-1
...
end do
end do
!$omp end parallel do
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How many times does the first loop construct execute? What is the average wall time per pass? There issubstantial overhead in the thread management and the convergence loop will incur that overhead on every pass.
It's more work but you might try creating your thread pool outside the convergence loop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another point: calculate the total memory bandwidth of the calculation and compare it to the memory bandwidth of your system. If it is saturating the memory controller and/or the cache <-> register data paths with one thread it will run in pretty much the same time with more threads.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
make that shared cache <-> private cache data paths.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page