Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28729 Discussions

Optimizing OpenMP code (question from swapnamohanty)

Steven_L_Intel1
Employee
843 Views
Moved from separate thread
0 Kudos
5 Replies
swapnamohanty
Beginner
843 Views
Hi Jim,

This is an unrelated issue but I have seen you have quite a depth in OpenMP architecture.

I am new to OpenMP architecture, and I am having difficulties in parallelizing this small portion of the

FORTRAN code. I have used OMP SINGLE, OMP BARRIER command within the loop where

"call ttrid(a,b,c,d,xx,ist,ifi)" is, to work but it's very inefficient. Can you please help me ?

Thanks.

Swapna

====================================================

do n=2,nmx-1

do m=2,mmx-1

do l=2,lmx-1

do k=2,kmx-1

do j=2,jmx-1

do i=2,imx-1

a(i) = aa(i,j,k,l,m,n)

b(i) =-ee(i,j,k,l,m,n)/alp

c(i) = bb(i,j,k,l,m,n)

d(i) =-cc(i,j,k,l,m,n)*s(i,j-1,k,l,m,n)

enddo

c

call ttrid(a,b,c,d,xx,ist,ifi)

c

do i=ist,ifi

s(i,j,k,l,m,n) = xx(i)

enddo

c

enddo

enddo

enddo

enddo

enddo

c

c---------------------------------------------------------------------

subroutine ttrid(a,b,c,d,xx,ist,ifi)

c

cp(ist) = c(ist)/b(ist)

dp(ist) = d(ist)/b(ist)

c

do 10 i=ist,ifi-1

cp(i+1) = c(i+1)/(b(i+1)-a(i+1)*cp(i))

dp(i+1) = (d(i+1)-a(i+1)*dp(i))/(b(i+1)-a(i+1)*cp(i))

10 continue

c

c--- backward substitution

c

xx(ifi) = dp(ifi)

do 20 i=ifi-1,ist,-1

xx(i) = dp(i) - cp(i)*xx(i+1)

20 continue

c

c---------------------------------------------------------------------

c

return

end

0 Kudos
cfrieler
Beginner
843 Views
Quoting - swapnamohanty
Hi Jim,

This is an unrelated issue but I have seen you have quite a depth in OpenMP architecture.

I am new to OpenMP architecture, and I am having difficulties in parallelizing this small portion of the

FORTRAN code. I have used OMP SINGLE, OMP BARRIER command within the loop where

"call ttrid(a,b,c,d,xx,ist,ifi)" is, to work but it's very inefficient. Can you please help me ?

Thanks.

Swapna

First, I would offer;

ai = 1.d0/alp; i1 = imx - 1
do n=2,nmx-1
do m=2,mmx-1
do l=2,lmx-1
do k=2,kmx-1
do j=2,jmx-1
ea(2:i1) = -ee(2:i1,j,k,l,m,n) * ai
call ttrid( aa(2:i1,j,k,l,m,n), ea, &
bb(2:i1,j,k,l,m,n), &
-cc(2:i1,j,k,l,m,n)*s(2:i1,j-1,k,l,m,n), &
s(ist:ifi,j,k,l,m,n), ist, ifi )
enddo
enddo
enddo
enddo
enddo

!--------------------------------------------------------------------
subroutine ttrid(a,b,c,d,xx,ist,ifi)
cpo = 0.d0; dpo = 0.d0
Do i=ist,ifi
bi = 1.d0/(b(i)-a(i)*cpo)
cpo = c(i)*bi; dpo = (d(i)-a(i)*dpo)*bi
cp(i) = cpo; dp(i) = dpo
EndDO
! --- backward substitution
xx(ifi) = dp(ifi)
Do i=ifi-1,ist,-1
xx(i) = dp(i) - cp(i)*xx(i+1)
EndDo
end

The differences aren't only style, as if you do an op count you find a big difference, the fp divides being the most important. Changing the call to ttrid to reference the original array sections avoids two copy operations and is safe given this code environment. But YMMV, as my experience with these changes is based on an old version of the compiler, and everything I have done could be automatic with the latest versions.

To specifically deal with your question, I hope you are using !$OMP DO and !$OMP END DO in the calling routine and not in ttrid. The loops in ttrid will not be threaded by OpenMP, as they reference elements (i) and (i+1) in the same statement. If one of the calling routine loops is labeled with these directives, I believe you should get speed-up, although the amount will strongly depend on many variables of the both the code (eg., array sizes) and the hardware (eg., cache sizes).

0 Kudos
jimdempseyatthecove
Honored Contributor III
843 Views

I suggest you create thread private copies (or stack/automatic)of your single dimensioned scratch arrays

a(:), b(:), c(:), d(:), xx(:), cp(:), dp(:)

and parallelize at the outer loop n

subroutine YourOriginalName()

!$OMP PARALLEL
call YourOriginalNameParallel()
!$OMP END PARALLEL
end subroutine YourOriginalName

subroutine YourOriginalName()
! stack method
real, automatic :: a(imx), b(imx), c(imx), d(imx), xx(imx), cp(imx), dp(imx)
! *** caution verify correct sizes for cp and dp (I do not know wat ifi is)
!$OMP DO
call YourOriginalNameParallel()
!$OMP END PARALLEL
end subroutine YourOriginalName


You can use thread private storage for the arrays and allocate once if the overhead of allocation/deallocation is too high.

Also, the other suggestion of reducing the number of divisions can be looked at.

For this type of loop, first look at parallelizing, next look at arranging code to vectorize better, next look at removing redundant operations.

Jim Dempsey



0 Kudos
jimdempseyatthecove
Honored Contributor III
843 Views

This may vectorize better

[cpp]cpPrior = cp(ist)
dpPrior = dp(ist)
do 10 i=ist+1,ifi-1
 
  cpCurrent = c(i)/(b(i)-a(i)*cpPrior)
  cp(i) = cpCurrent 

  dpCurrent = (d(i)-a(i)*dpPrior)/(b(i)-a(i)*cpPrior)
  dp(i) = dpCurrent

  cpPrior = cpCurrent
  dpPrior = dpCurrent

10 continue

[/cpp]
Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
843 Views

After looking at it again, it likely will not vectorize but it may run faster using the local temps.

Jim
0 Kudos
Reply