Hi all,
Last time I posted a query and the problem was solved by fusing all the omp regions.
However this time the problem seems to be with scalability on openmp.
The code doesn't scale well.
Its about 3X on 16 Xeon cores(Intel(R) Xeon(R) E5-2650 v2)
How to improve scalability??
here is the code
do k=1,km-1
do kk=1,2
starttime = omp_get_wtime()
!$OMP PARALLEL PRIVATE(I)DEFAULT(SHARED)
!$omp do
do j=1,ny_block
do i=1,nx_block
LMASK(i,j) = TLT%K_LEVEL(i,j,bid) == k .and. &
TLT%K_LEVEL(i,j,bid) < KMT(i,j,bid) .and. &
TLT%ZTW(i,j,bid) == 1
if ( LMASK(i,j) ) then
WORK1(i,j,kk) = KAPPA_THIC(i,j,kbt,k,bid) &
* SLX(i,j,kk,kbt,k,bid) * dz(k)
WORK2(i,j,kk) = c2 * dzwr(k) * ( WORK1(i,j,kk) &
- KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) &
* dz(k+1) )
WORK2_NEXT(i,j) = c2 * ( &
KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) - &
KAPPA_THIC(i,j,kbt,k+1,bid) * SLX(i,j,kk,kbt,k+1,bid) )
WORK3(i,j,kk) = KAPPA_THIC(i,j,kbt,k,bid) &
* SLY(i,j,kk,kbt,k,bid) * dz(k)
WORK4(i,j,kk) = c2 * dzwr(k) * ( WORK3(i,j,kk) &
- KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) &
* dz(k+1) )
WORK4_NEXT(i,j) = c2 * ( &
KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) - &
KAPPA_THIC(i,j,kbt,k+1,bid) * SLY(i,j,kk,kbt,k+1,bid) )
endif
if( LMASK(i,j) .and. abs( WORK2_NEXT(i,j) ) < abs( WORK2(i,j,kk) ) ) then
WORK2(i,j,kk) = WORK2_NEXT(i,j)
endif
if ( LMASK(i,j) .and. abs( WORK4_NEXT(i,j) ) < abs( WORK4(i,j,kk ) ) ) then
WORK4(i,j,kk) = WORK4_NEXT(i,j)
endif
LMASK(i,j) = TLT%K_LEVEL(i,j,bid) == k .and. &
TLT%K_LEVEL(i,j,bid) < KMT(i,j,bid) .and. &
TLT%ZTW(i,j,bid) == 2
if ( LMASK(i,j) ) then
WORK1(i,j,kk) = KAPPA_THIC(i,j,ktp,k+1,bid) &
* SLX(i,j,kk,ktp,k+1,bid)
WORK2(i,j,kk) = c2 * ( WORK1(i,j,kk) &
- ( KAPPA_THIC(i,j,kbt,k+1,bid) &
* SLX(i,j,kk,kbt,k+1,bid) ) )
WORK1(i,j,kk) = WORK1(i,j,kk) * dz(k+1)
WORK3(i,j,kk) = KAPPA_THIC(i,j,ktp,k+1,bid) &
* SLY(i,j,kk,ktp,k+1,bid)
WORK4(i,j,kk) = c2 * ( WORK3(i,j,kk) &
- ( KAPPA_THIC(i,j,kbt,k+1,bid) &
* SLY(i,j,kk,kbt,k+1,bid) ) )
WORK3(i,j,kk) = WORK3(i,j,kk) * dz(k+1)
endif
LMASK(i,j) = LMASK(i,j) .and. TLT%K_LEVEL(i,j,bid) + 1 < KMT(i,j,bid)
if (k.lt.km-1) then ! added to avoid out of bounds access
if( LMASK(i,j) ) then
WORK2_NEXT(i,j) = c2 * dzwr(k+1) * ( &
KAPPA_THIC(i,j,kbt,k+1,bid) * SLX(i,j,kk,kbt,k+1,bid) * dz(k+1) - &
KAPPA_THIC(i,j,ktp,k+2,bid) * SLX(i,j,kk,ktp,k+2,bid) * dz(k+2))
WORK4_NEXT(i,j) = c2 * dzwr(k+1) * ( &
KAPPA_THIC(i,j,kbt,k+1,bid) * SLY(i,j,kk,kbt,k+1,bid) * dz(k+1) - &
KAPPA_THIC(i,j,ktp,k+2,bid) * SLY(i,j,kk,ktp,k+2,bid) * dz(k+2))
endif
end if
if( LMASK(i,j) .and. abs( WORK2_NEXT(i,j) ) < abs( WORK2(i,j,kk) ) ) &
WORK2(i,j,kk) = WORK2_NEXT(i,j)
if( LMASK(i,j) .and. abs(WORK4_NEXT(i,j)) < abs(WORK4(i,j,kk)) ) &
WORK4(i,j,kk) = WORK4_NEXT(i,j)
enddo
enddo
!$omp end do
!$OMP END PARALLEL
endtime = omp_get_wtime()
total = total + (endtime - starttime)
enddo
enddo
Also attached is the standalone working code I have created, so that you guys could run it on your machines if you wish.
The attached code is a standalone version created from a larger code piece.
連結已複製
Aketh, you may find this article interesting:
https://software.intel.com/en-us/articles/peel-the-onion-optimization-techniques
It uses your example code and shows how you can incrementally improve the performance. On an E5-2620 6 core, 12 thread system, the original parallel speedup was a paltry 4.94 x that of the original serial code. Then in three optimization steps taking it to 12.16x, 15.24x and 23.13x. There is even room for a little more improvement, which I left up for the readers.
Jim Dempsey