Re: Optimizing OpenMP code (question from swapnamohanty)

Steven_L_Intel1 · ‎07-08-2009

Moved from separate thread

swapnamohanty · ‎07-08-2009

Hi Jim,

This is an unrelated issue but I have seen you have quite a depth in OpenMP architecture.

I am new to OpenMP architecture, and I am having difficulties in parallelizing this small portion of the

FORTRAN code. I have used OMP SINGLE, OMP BARRIER command within the loop where

"call ttrid(a,b,c,d,xx,ist,ifi)" is, to work but it's very inefficient. Can you please help me ?

Thanks.

Swapna

====================================================

do n=2,nmx-1

do m=2,mmx-1

do l=2,lmx-1

do k=2,kmx-1

do j=2,jmx-1

do i=2,imx-1

a(i) = aa(i,j,k,l,m,n)

b(i) =-ee(i,j,k,l,m,n)/alp

c(i) = bb(i,j,k,l,m,n)

d(i) =-cc(i,j,k,l,m,n)*s(i,j-1,k,l,m,n)

enddo

c

call ttrid(a,b,c,d,xx,ist,ifi)

c

do i=ist,ifi

s(i,j,k,l,m,n) = xx(i)

enddo

c

enddo

c

c---------------------------------------------------------------------

subroutine ttrid(a,b,c,d,xx,ist,ifi)

c

cp(ist) = c(ist)/b(ist)

dp(ist) = d(ist)/b(ist)

c

do 10 i=ist,ifi-1

cp(i+1) = c(i+1)/(b(i+1)-a(i+1)*cp(i))

dp(i+1) = (d(i+1)-a(i+1)*dp(i))/(b(i+1)-a(i+1)*cp(i))

10 continue

c

c--- backward substitution

c

xx(ifi) = dp(ifi)

do 20 i=ifi-1,ist,-1

xx(i) = dp(i) - cp(i)*xx(i+1)

20 continue

c

c---------------------------------------------------------------------

c

return

end

cfrieler · ‎07-08-2009

Quoting - swapnamohanty

Hi Jim,

This is an unrelated issue but I have seen you have quite a depth in OpenMP architecture.

I am new to OpenMP architecture, and I am having difficulties in parallelizing this small portion of the

FORTRAN code. I have used OMP SINGLE, OMP BARRIER command within the loop where

"call ttrid(a,b,c,d,xx,ist,ifi)" is, to work but it's very inefficient. Can you please help me ?

Thanks.

Swapna

First, I would offer;

ai = 1.d0/alp; i1 = imx - 1
do n=2,nmx-1
do m=2,mmx-1
do l=2,lmx-1
do k=2,kmx-1
do j=2,jmx-1
ea(2:i1) = -ee(2:i1,j,k,l,m,n) * ai
call ttrid( aa(2:i1,j,k,l,m,n), ea, &
bb(2:i1,j,k,l,m,n), &
-cc(2:i1,j,k,l,m,n)*s(2:i1,j-1,k,l,m,n), &
s(ist:ifi,j,k,l,m,n), ist, ifi )
enddo
enddo
enddo
enddo
enddo

!--------------------------------------------------------------------
subroutine ttrid(a,b,c,d,xx,ist,ifi)
cpo = 0.d0; dpo = 0.d0
Do i=ist,ifi
bi = 1.d0/(b(i)-a(i)*cpo)
cpo = c(i)*bi; dpo = (d(i)-a(i)*dpo)*bi
cp(i) = cpo; dp(i) = dpo
EndDO
! --- backward substitution
xx(ifi) = dp(ifi)
Do i=ifi-1,ist,-1
xx(i) = dp(i) - cp(i)*xx(i+1)
EndDo
end

The differences aren't only style, as if you do an op count you find a big difference, the fp divides being the most important. Changing the call to ttrid to reference the original array sections avoids two copy operations and is safe given this code environment. But YMMV, as my experience with these changes is based on an old version of the compiler, and everything I have done could be automatic with the latest versions.

To specifically deal with your question, I hope you are using !$OMP DO and !$OMP END DO in the calling routine and not in ttrid. The loops in ttrid will not be threaded by OpenMP, as they reference elements (i) and (i+1) in the same statement. If one of the calling routine loops is labeled with these directives, I believe you should get speed-up, although the amount will strongly depend on many variables of the both the code (eg., array sizes) and the hardware (eg., cache sizes).

jimdempseyatthecove · ‎07-09-2009

I suggest you create thread private copies (or stack/automatic)of your single dimensioned scratch arrays

a(:), b(:), c(:), d(:), xx(:), cp(:), dp(:)

and parallelize at the outer loop n

subroutine YourOriginalName()

!$OMP PARALLEL
call YourOriginalNameParallel()
!$OMP END PARALLEL
end subroutine YourOriginalName

subroutine YourOriginalName()
! stack method
real, automatic :: a(imx), b(imx), c(imx), d(imx), xx(imx), cp(imx), dp(imx)
! *** caution verify correct sizes for cp and dp (I do not know wat ifi is)
!$OMP DO
call YourOriginalNameParallel()
!$OMP END PARALLEL
end subroutine YourOriginalName

You can use thread private storage for the arrays and allocate once if the overhead of allocation/deallocation is too high.

Also, the other suggestion of reducing the number of divisions can be looked at.

For this type of loop, first look at parallelizing, next look at arranging code to vectorize better, next look at removing redundant operations.

Jim Dempsey

jimdempseyatthecove · ‎07-09-2009

This may vectorize better

[cpp]cpPrior = cp(ist)
dpPrior = dp(ist)
do 10 i=ist+1,ifi-1
 
  cpCurrent = c(i)/(b(i)-a(i)*cpPrior)
  cp(i) = cpCurrent 

  dpCurrent = (d(i)-a(i)*dpPrior)/(b(i)-a(i)*cpPrior)
  dp(i) = dpCurrent

  cpPrior = cpCurrent
  dpPrior = dpCurrent

10 continue

[/cpp]

Jim Dempsey

jimdempseyatthecove · ‎07-09-2009

After looking at it again, it likely will not vectorize but it may run faster using the local temps.

Jim