- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is an unrelated issue but I have seen you have quite a depth in OpenMP architecture.
I am new to OpenMP architecture, and I am having difficulties in parallelizing this small portion of the
FORTRAN code. I have used OMP SINGLE, OMP BARRIER command within the loop where
"call ttrid(a,b,c,d,xx,ist,ifi)" is, to work but it's very inefficient. Can you please help me ?
Thanks.
Swapna
====================================================
do n=2,nmx-1
do m=2,mmx-1
do l=2,lmx-1
do k=2,kmx-1
do j=2,jmx-1
do i=2,imx-1
a(i) = aa(i,j,k,l,m,n)
b(i) =-ee(i,j,k,l,m,n)/alp
c(i) = bb(i,j,k,l,m,n)
d(i) =-cc(i,j,k,l,m,n)*s(i,j-1,k,l,m,n)
enddo
c
call ttrid(a,b,c,d,xx,ist,ifi)
c
do i=ist,ifi
s(i,j,k,l,m,n) = xx(i)
enddo
c
enddo
enddo
enddo
enddo
enddo
c
c---------------------------------------------------------------------
subroutine ttrid(a,b,c,d,xx,ist,ifi)
c
cp(ist) = c(ist)/b(ist)
dp(ist) = d(ist)/b(ist)
c
do 10 i=ist,ifi-1
cp(i+1) = c(i+1)/(b(i+1)-a(i+1)*cp(i))
dp(i+1) = (d(i+1)-a(i+1)*dp(i))/(b(i+1)-a(i+1)*cp(i))
10 continue
c
c--- backward substitution
c
xx(ifi) = dp(ifi)
do 20 i=ifi-1,ist,-1
xx(i) = dp(i) - cp(i)*xx(i+1)
20 continue
c
c---------------------------------------------------------------------
c
return
end
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is an unrelated issue but I have seen you have quite a depth in OpenMP architecture.
I am new to OpenMP architecture, and I am having difficulties in parallelizing this small portion of the
FORTRAN code. I have used OMP SINGLE, OMP BARRIER command within the loop where
"call ttrid(a,b,c,d,xx,ist,ifi)" is, to work but it's very inefficient. Can you please help me ?
Thanks.
Swapna
First, I would offer;
ai = 1.d0/alp; i1 = imx - 1
do n=2,nmx-1
do m=2,mmx-1
do l=2,lmx-1
do k=2,kmx-1
do j=2,jmx-1
ea(2:i1) = -ee(2:i1,j,k,l,m,n) * ai
call ttrid( aa(2:i1,j,k,l,m,n), ea, &
bb(2:i1,j,k,l,m,n), &
-cc(2:i1,j,k,l,m,n)*s(2:i1,j-1,k,l,m,n), &
s(ist:ifi,j,k,l,m,n), ist, ifi )
enddo
enddo
enddo
enddo
enddo
!--------------------------------------------------------------------
subroutine ttrid(a,b,c,d,xx,ist,ifi)
cpo = 0.d0; dpo = 0.d0
Do i=ist,ifi
bi = 1.d0/(b(i)-a(i)*cpo)
cpo = c(i)*bi; dpo = (d(i)-a(i)*dpo)*bi
cp(i) = cpo; dp(i) = dpo
EndDO
! --- backward substitution
xx(ifi) = dp(ifi)
Do i=ifi-1,ist,-1
xx(i) = dp(i) - cp(i)*xx(i+1)
EndDo
end
The differences aren't only style, as if you do an op count you find a big difference, the fp divides being the most important. Changing the call to ttrid to reference the original array sections avoids two copy operations and is safe given this code environment. But YMMV, as my experience with these changes is based on an old version of the compiler, and everything I have done could be automatic with the latest versions.
To specifically deal with your question, I hope you are using !$OMP DO and !$OMP END DO in the calling routine and not in ttrid. The loops in ttrid will not be threaded by OpenMP, as they reference elements (i) and (i+1) in the same statement. If one of the calling routine loops is labeled with these directives, I believe you should get speed-up, although the amount will strongly depend on many variables of the both the code (eg., array sizes) and the hardware (eg., cache sizes).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suggest you create thread private copies (or stack/automatic)of your single dimensioned scratch arrays
a(:), b(:), c(:), d(:), xx(:), cp(:), dp(:)
and parallelize at the outer loop n
subroutine YourOriginalName()
!$OMP PARALLEL
call YourOriginalNameParallel()
!$OMP END PARALLEL
end subroutine YourOriginalName
subroutine YourOriginalName()
! stack method
real, automatic :: a(imx), b(imx), c(imx), d(imx), xx(imx), cp(imx), dp(imx)
! *** caution verify correct sizes for cp and dp (I do not know wat ifi is)
!$OMP DO
call YourOriginalNameParallel()
!$OMP END PARALLEL
end subroutine YourOriginalName
You can use thread private storage for the arrays and allocate once if the overhead of allocation/deallocation is too high.
Also, the other suggestion of reducing the number of divisions can be looked at.
For this type of loop, first look at parallelizing, next look at arranging code to vectorize better, next look at removing redundant operations.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This may vectorize better
[cpp]cpPrior = cp(ist) dpPrior = dp(ist) do 10 i=ist+1,ifi-1 cpCurrent = c(i)/(b(i)-a(i)*cpPrior) cp(i) = cpCurrent dpCurrent = (d(i)-a(i)*dpPrior)/(b(i)-a(i)*cpPrior) dp(i) = dpCurrent cpPrior = cpCurrent dpPrior = dpCurrent 10 continue [/cpp]Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
After looking at it again, it likely will not vectorize but it may run faster using the local temps.
Jim
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page