If WORKtemp knocked out

aketh_t_ · ‎08-31-2015

Hi all,

Last time I posted a query and the problem was solved by fusing all the omp regions.

However this time the problem seems to be with scalability on openmp.

The code doesn't scale well.

Its about 3X on 16 Xeon cores(Intel(R) Xeon(R) E5-2650 v2)

How to improve scalability??

here is the code

do k=1,km-1

        do kk=1,2

          starttime = omp_get_wtime()

          !$OMP PARALLEL PRIVATE(I)DEFAULT(SHARED)
 
          !$omp do  
          do j=1,ny_block
           do i=1,nx_block
 
          LMASK(i,j) = TLT%K_LEVEL(i,j,bid) == k  .and.            &
                       TLT%K_LEVEL(i,j,bid) < KMT(i,j,bid)  .and.  &
                       TLT%ZTW(i,j,bid) == 1


            if ( LMASK(i,j) ) then 

             WORK1(i,j,kk) =  KAPPA_THIC(i,j,kbt,k,bid)  &
                           * SLX(i,j,kk,kbt,k,bid) * dz(k)

             WORK2(i,j,kk) = c2 * dzwr(k) * ( WORK1(i,j,kk)            &
              - KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) &
                                            * dz(k+1) )

             WORK2_NEXT(i,j) = c2 * ( &
              KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) - &
              KAPPA_THIC(i,j,kbt,k+1,bid) * SLX(i,j,kk,kbt,k+1,bid) )

             WORK3(i,j,kk) =  KAPPA_THIC(i,j,kbt,k,bid)  &
                           * SLY(i,j,kk,kbt,k,bid) * dz(k)

             WORK4(i,j,kk) = c2 * dzwr(k) * ( WORK3(i,j,kk)            &
              - KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) &
                                            * dz(k+1) )

             WORK4_NEXT(i,j) = c2 * ( &
              KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) - &
              KAPPA_THIC(i,j,kbt,k+1,bid) * SLY(i,j,kk,kbt,k+1,bid) )

            endif

            if( LMASK(i,j) .and. abs( WORK2_NEXT(i,j) ) < abs( WORK2(i,j,kk) ) ) then 

             WORK2(i,j,kk) = WORK2_NEXT(i,j)

            endif

           if ( LMASK(i,j) .and. abs( WORK4_NEXT(i,j) ) < abs( WORK4(i,j,kk ) ) ) then 
             WORK4(i,j,kk) = WORK4_NEXT(i,j)
           endif

          LMASK(i,j) = TLT%K_LEVEL(i,j,bid) == k  .and.           &
                       TLT%K_LEVEL(i,j,bid) < KMT(i,j,bid)  .and. &
                       TLT%ZTW(i,j,bid) == 2

          if ( LMASK(i,j) ) then

            WORK1(i,j,kk) =  KAPPA_THIC(i,j,ktp,k+1,bid)     & 
                           * SLX(i,j,kk,ktp,k+1,bid)

            WORK2(i,j,kk) =  c2 * ( WORK1(i,j,kk)                 &
                           - ( KAPPA_THIC(i,j,kbt,k+1,bid)        &
                              * SLX(i,j,kk,kbt,k+1,bid) ) )

            WORK1(i,j,kk) = WORK1(i,j,kk) * dz(k+1)

            WORK3(i,j,kk) =  KAPPA_THIC(i,j,ktp,k+1,bid)     &
                           * SLY(i,j,kk,ktp,k+1,bid)

            WORK4(i,j,kk) =  c2 * ( WORK3(i,j,kk)                 &
                           - ( KAPPA_THIC(i,j,kbt,k+1,bid)        &
                              * SLY(i,j,kk,kbt,k+1,bid) ) )

            WORK3(i,j,kk) = WORK3(i,j,kk) * dz(k+1)

            endif
 
          LMASK(i,j) = LMASK(i,j) .and. TLT%K_LEVEL(i,j,bid) + 1 < KMT(i,j,bid)

          if (k.lt.km-1) then ! added to avoid out of bounds access

            if( LMASK(i,j) ) then

              WORK2_NEXT(i,j) = c2 * dzwr(k+1) * ( &
                KAPPA_THIC(i,j,kbt,k+1,bid) * SLX(i,j,kk,kbt,k+1,bid) * dz(k+1) - &
                KAPPA_THIC(i,j,ktp,k+2,bid) * SLX(i,j,kk,ktp,k+2,bid) * dz(k+2))

              WORK4_NEXT(i,j) = c2 * dzwr(k+1) * ( &
                KAPPA_THIC(i,j,kbt,k+1,bid) * SLY(i,j,kk,kbt,k+1,bid) * dz(k+1) - &
                KAPPA_THIC(i,j,ktp,k+2,bid) * SLY(i,j,kk,ktp,k+2,bid) * dz(k+2))

              endif 

          end if
             
          if( LMASK(i,j) .and. abs( WORK2_NEXT(i,j) ) < abs( WORK2(i,j,kk) ) ) &
            WORK2(i,j,kk) = WORK2_NEXT(i,j)

          if( LMASK(i,j) .and. abs(WORK4_NEXT(i,j)) < abs(WORK4(i,j,kk)) ) &
            WORK4(i,j,kk) = WORK4_NEXT(i,j)

             enddo
          enddo
          !$omp end do

          !$OMP END PARALLEL

          endtime = omp_get_wtime()

          total = total + (endtime - starttime)

        enddo
      enddo

Also attached is the standalone working code I have created, so that you guys could run it on your machines if you wish.

The attached code is a standalone version created from a larger code piece.

aketh_t_ · ‎08-31-2015

here are the results on threads

Xeon	timings
1	1.10E-02
2	8.27E-03
4	6.37E-03
8	3.67E-03
16	2.65E-03

jimdempseyatthecove · ‎08-31-2015

This looks like code posted elsewhere.

The runtime of the of the parallel region is too short to make effective use of parallelization. Your parallel region is inside your do k=1,km-1 and do kk=1,2 loops or (60-1)*2 = 118 iterations. The time you display is for the sum of the runtimes of the enclosed parallel region. Yielding parallel region run times of 1/118'th that listed:

1 9.32203E-05
2 7.00847E-05
4 5.39831E-05
8 3.11017E-05
16 2.16949E-05

Starting and ending a parallel region has some expense that has to be amortized. Restructure your code to locate the parallel region further out (if possible at the do k= level).

Your code is not incrementing the variable bid.
Your LMASK array can be replaced with a scalar.
Your outputs are being overwritten (missing code or bug).

Jim Dempsey

aketh_t_ · ‎08-31-2015

As I told you the changed.F90 is part of a huge file which you will not be able to run.

Its just a subroutine of CESM.

at K level it isnt parallel.

So at kk as well.

Look at the updates at WORK1 you will notice why.

variables are overwritten??

if you mean I am changing work1(1,1,1) from current iteration to next iteration that is true. it works like that itself.

aketh_t_ · ‎08-31-2015

Lasttime the issue was with speedup.

Now the issue is with scaling.

you can increase the nx_block and ny_block to a huge value and increase the code runtime and demonstrate speedup as well.

I was more concerned with false-sharing. Is it not a huge overhead in my code considering all the shared variables?

aketh_t_ · ‎08-31-2015

bid is a constant per entry to the subroutine of CESM. so it wont change.

My code looks strange, but I cant explain why. that is how it is.

jimdempseyatthecove · ‎08-31-2015

If you replace the LMASK array with LMASK scalar (because each thread references only one cell per iteration) then the LMASK setting won't write to memory and won't false share on the ~ first 8 and last 8 cells of the thread's zone of LMASK as array. Even without the false sharing of LMASK as an array, this will save the writing to RAM and to some lesser extent the Read Modify Write of the cache line containing the cell for LMASK(i,j)

To reduce the number of entry/exit of parallel regions (assuming missing code permits this) consider:

starttime = omp_get_wtime()
!$OMP PARALLEL PRIVATE(I, k, kk) DEFAULT(SHARED)
! note, all threads redundantly executing these two loops
do k=1,km-1
  do kk=1,2
    ! slice the inner range      
    !$omp do  
    do j=1,ny_block
      do i=1,nx_block
... 
      enddo
    enddo
    !$omp end do
    ! The above !$OMP has an implied barrier
    ! this will keep the k and kk loops in sync
  enddo
enddo
!$OMP END PARALLEL
endtime = omp_get_wtime()
total = total + (endtime - starttime)

Jim Dempsey

aketh_t_ · ‎08-31-2015

i did try the LMASK. I have forgotten to add to make it part of the standalone code I have provided.

It offered a small benefit.

code was about 1.2-1.5X over the existing speedup we got.

I will try this new method mentioned above. Seems good in theory. must check if it works well.

jimdempseyatthecove · ‎08-31-2015

Although the compiler should optimize this:

            WORK1(i,j,kk) =  KAPPA_THIC(i,j,ktp,k+1,bid)     & 
                           * SLX(i,j,kk,ktp,k+1,bid)

            WORK2(i,j,kk) =  c2 * ( WORK1(i,j,kk)                 &
                           - ( KAPPA_THIC(i,j,kbt,k+1,bid)        &
                              * SLX(i,j,kk,kbt,k+1,bid) ) )

            WORK1(i,j,kk) = WORK1(i,j,kk) * dz(k+1)

consider using this instead:

            WORK1temp =  KAPPA_THIC(i,j,ktp,k+1,bid)     & 
                           * SLX(i,j,kk,ktp,k+1,bid)

            WORK2(i,j,kk) =  c2 * ( WORK1temp                 &
                           - ( KAPPA_THIC(i,j,kbt,k+1,bid)        &
                              * SLX(i,j,kk,kbt,k+1,bid) ) )

            WORK1(i,j,kk) = WORK1temp * dz(k+1)

And make similar use of temporaries where applicable.

Jim Dempsey

jimdempseyatthecove · ‎08-31-2015

You might want to experiment with changing:

if (k.lt.km-1) then ! added to avoid out of bounds access
if( LMASK(i,j) ) then

to

if (k.eq.km-1) LMASK(i,j) = .false. ! added to avoid out of bounds access (remove endif)
if( LMASK(i,j) ) then

The reasoning for this is whereas the former code might not vectorize, the latter may have a better chance at vectorization.

Jim Dempsey

aketh_t_ · ‎09-01-2015

i tried combining the parallel region of K level as suggested.

interestingly there isnt great improvement as such.

its still approx 2.7E-03.

Checking if using local variables help.

aketh_t_ · ‎09-01-2015

using local variable like worktemp seems to be a bad idea.

must be because we are not vectorizing then?

Performance has degraded.

jimdempseyatthecove · ‎09-01-2015

If WORKtemp knocked out vectorization (over use of array) then this is a lost optimization opportunity by the compiler. It may be a case that the major loop unde optimization has too many things to keep track of.

Try the following (although I think it is redundant to do this:

    ! slice the inner range      
    !$omp do  
    do j=1,ny_block
      !$omp simd
      do i=1,nx_block

Jim Dempsey

jimdempseyatthecove · ‎09-01-2015

Or

    ! slice the inner range      
    !$omp do simd collapse(2)
    do j=1,ny_block
      do i=1,nx_block

Jim Dempsey

aketh_t_ · ‎09-01-2015

I meant we must replace worktemp by worktemp(1:10) work on vector lengths per iteration.

jimdempseyatthecove · ‎09-01-2015

Try the simd clause with the scalar temp, then if necessary go back to the original array format using simd clause (both ways)

Jim Dempsey

aketh_t_ · ‎09-01-2015

not helping either.

seems like KAPPA and SLX have unaligned access. we may have to copy them to temporaries and work. that must help.

also if you notice work1,work2 work2_next go hand in hand.

I think breaking the loop to accomodate these dependencies will help.

Any way I was following your quickthread programming site.

Do you publish papers or have you published any papers with these models?

jimdempseyatthecove · ‎09-01-2015

I haven't done much with the QuickThread threading toolkit since about 2009. I couldn't get enough followers to support continued development. Which is surprising seeing that it resolves(ed) virtually all of the threading issues from small single CPU multi-core systems to large multi CPU many core (and NUMA) systems. I've even suggested to Intel of having me of incorporating the threading pool design concept into OpenMP (both C++ and Fortran) as an experimental extension. This then could be used as a test bed to either prove or disprove the merits of the extension. I am a strong proponent of making the pudding (proof of the pudding) and then letting the reviewers have a taste. Intel has experimented with non-standard extensions to OpenMP (e.g. CREW). And I am well familiar with the OpenMP limitations with regard to thread teaming.

Consider a problem similar to your problem, only much larger run on a multi-node NUMA system. And where you know you could gain (significant) performance provided you could control the cache locality. This is extremely hard to do using current OpenMP features. The is especially true when one portion of your code favors one organization of thread teams and a different portion of your code favors different thread teaming arrangement and considering that you do not want to oversubscribe threads or under subscribe threads. There is no way to do this using current OpenMP features.

**** HYPOTHETICAL OpenMP extension ****

! Outer loop distributed by NUMA node
!$OMP PARALLEL DO TEAM(OneEachM0$) collapse(2) PRIVATE(I, k, kk) DEFAULT(SHARED)
do k=1,km-1
  do kk=1,2
    ! slice the inner range      
    !$omp parallel do TEAM(OneEachL2WithinM0$) ! slice by core within NUMA node
    do j=1,ny_block
      !$omp parallel do TEAM(L2$) ! slice by threads within core
      do i=1,nx_block
... 
      enddo
      !$omp end parallel do
    enddo
    !$omp end parallel do
  enddo
enddo
!$OMP END PARALLEL do

**** end HYPOTHETICAL OpenMP extension ****

Notes:

There is no requirement for any convoluted KMP_AFFINITY or OMP_PLACE_THREADS (which may not be ideal for all portions of the application).

There is no requirement for the application to know anything about the topology. On a small 1 CPU system the outer loop has a team of 1.

Unlike standard OpenMP nested parallelism, the proposed system constitutes a single thread pool, affinity pinned for compute threads, and non-pinned higher priority threads for I/O. The single pool is organizable in any manner you choose at point of call (!$OMP...TEAM(...)) and it does so without creating a new pool (thus avoids oversubscription). This organization is not constricted by an arrangement setup by way of environment variables (other than those used to constrict the application to a subset of the available logical processors).

As a further extension, assume that at some level of the above nested loop structure you need to output results data. With standard OpenMP how would you do this without mucking up the running of the nested loop? About the only avenues you have are a) create an independent thread using pthreads (or other such thread), b) Use the OpenMP TASK but your nested loop if fully subscribed so this will either oversubscribe or most likely perform the task immediately thus effectively blocking the current thread through the duration of any I/O wait, or c) perform the I/O directly in the thread, possibly within a critical section, thus blocking not only all compute threads wishing I/O while other thread is pending I/O. None of those three "solutions" are particularly appealing.

A better way would be to borrow the QuickThread feature of having an I/O thread or pool of threads.

**** HYPOTHETICAL OpenMP extension ****

! Outer loop distributed by NUMA node
!$OMP PARALLEL DO TEAM(OneEachM0$) PRIVATE(I, k, kk) DEFAULT(SHARED)
do k=1,km-1
  do kk=1,2
    ! slice the inner range      
    !$omp parallel do TEAM(OneEachL2$) ! slice by core 
    do j=1,ny_block
      !$omp parallel do TEAM(L2$) ! slice by threads within core
      do i=1,nx_block
... 
      enddo
      !$omp end parallel do
      !$omp task team(IO$) private(args) ! enqueue to I/O thread
      call yourIOroutine(args)
      !$omp end task
    enddo
    !$omp end parallel do
  enddo
enddo
!$OMP END PARALLEL do

**** end HYPOTHETICAL OpenMP extension ****

The above new statement would enqueue the task into a higher priority thread queue, who's threads are typically not affinity pinned (i.e. will run on any logical processor, preferably a waiting logical processor), and most importantly, while the higher priority thread is blocked performing I/O, the lesser priority compute thread that was preempted is permitted the resume. You can also optionally constrict the I/O threads to one or more logical processors (e.g. those of the last core on a Xeon Phi which are otherwise not used, or on the socket that is directly connected to an I/O device, or a thread that "owns" the GPU).

Any comments by the community as to the desirability of features like this would be welcome.

Jim Dempsey

TimP · ‎09-01-2015

I think we've said it before: the alignments which are important are of the arrays which are stored. If all the work arrays are set up under -align array32byte and the leading dimension is a multiple of 32 bytes, you can assert those arrays aligned . It does look as if your scaling would be limited by memory bandwidth.

jimdempseyatthecove · ‎09-01-2015

Experiment wit associate. Something like this (be careful):

associate (TLT_K_LEVELa => TLT%K_LEVEL(:,:,bid), &
&          KMTa => KMT(:,:,bid), &
&          TLT_ZTWa => TLT%ZTW(:,:,bid), &
&          ...) ! end of outer level associate
do k=1,km-1
  associate (dza => dz(k), &
&            dza1 => dz(k+1), &
&            ...) ! end of k level associate
  do kk=1,2
    starttime = omp_get_wtime()
    ! be carefule of the kbt, ktp, k+1, ... associates
    associate (WORK1a => WORK1(:,:,kk), &
&              WORK2a => WORK2(:,:,kk), &
&              WORK3a => WORK3(:,:,kk), &
&              WORK4a => WORK4(:,:,kk), &
&              KAPPA_THICa => KAPPA_THIC(:,:,kbt,k,bid), &
&              SLXa => SLX(:,:,kk,kbt,k,bid), &
&              SLXa1 => SLX(i,j,kk,ktp,k+1,bid) & 
&              ...) ! end of kk level associate
    !$OMP PARALLEL
    !$omp do  
    do j=1,ny_block
      do i=1,nx_block
        LMASK(i,j) = TLT_K_LEVELa(i,j) == k  .and.            &
                     TLT_K_LEVELa(i,j) < KMTa(i,j)  .and.  &
                     TLT_ZTWa(i,j) == 1

        if ( LMASK(i,j) ) then 
          WORK1a(i,j) =  KAPPA_THICa(i,j)  &
                         * SLXa(i,j) * dza
          WORK2a(i,j) = c2 * dzwra * ( WORK1a(i,j)            &
              - KAPPA_THICa1(i,j) * SLXa1(i,j) &
                                       * dza1 )
          ...
        endif
      enddo ! i
    enddo ! j
    !$omp end do
    !$OMP END PARALLEL
    end associate ! kk level
    endtime = omp_get_wtime()
    total = total + (endtime - starttime)
  enddo
  end associate ! k level
enddo
end associate ! outer level

Jim Dempsey

jimdempseyatthecove · ‎09-01-2015

Associate may or may not lose alignment attributes. You may need a !DIR$ to re-assert known alignments.

In the case of the arrays, associate will have to construct an appropriate array descriptor (same thing as if you made a CALL). I cannot attest as to if this would be faster with associate or not, it depends on if the optimization had problems with the reducing the number of subscript references.

Jim Dempsey

problem in scaling the code