Unexplained speedup how?

aketh_t_ · ‎07-27-2015

Hi all,

I have got a speedup of approximately 4x with just rewriting a few where statements in my loop to explicit do loops as illustrated below.

Unchanged code

where ( LMASK )

            WORK1(:,:,kk) =  KAPPA_THIC(:,:,kbt,k,bid)  &
                           * SLX(:,:,kk,kbt,k,bid) * dz(k)
            WORK2(:,:,kk) = c2 * dzwr(k) * ( WORK1(:,:,kk)            &
              - KAPPA_THIC(:,:,ktp,k+1,bid) * SLX(:,:,kk,ktp,k+1,bid) &
                                            * dz(k+1) )

            WORK2_NEXT = c2 * ( &
              KAPPA_THIC(:,:,ktp,k+1,bid) * SLX(:,:,kk,ktp,k+1,bid) - &
              KAPPA_THIC(:,:,kbt,k+1,bid) * SLX(:,:,kk,kbt,k+1,bid) )

            WORK3(:,:,kk) =  KAPPA_THIC(:,:,kbt,k,bid)  &
                           * SLY(:,:,kk,kbt,k,bid) * dz(k)
            WORK4(:,:,kk) = c2 * dzwr(k) * ( WORK3(:,:,kk)            &
              - KAPPA_THIC(:,:,ktp,k+1,bid) * SLY(:,:,kk,ktp,k+1,bid) &
                                            * dz(k+1) )

            WORK4_NEXT = c2 * ( &
              KAPPA_THIC(:,:,ktp,k+1,bid) * SLY(:,:,kk,ktp,k+1,bid) - &
              KAPPA_THIC(:,:,kbt,k+1,bid) * SLY(:,:,kk,kbt,k+1,bid) )

          endwhere

Changed code

do j=1,ny_block
           do i=1,nx_block

            if ( LMASK(i,j) ) then

            WORK1(i,j,kk) =  KAPPA_THIC(i,j,kbt,k,bid)  &
                           * SLX(i,j,kk,kbt,k,bid) * dz(k)

            WORK2(i,j,kk) = c2 * dzwr(k) * ( WORK1(i,j,kk)            &
              - KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) &
                                            * dz(k+1) )

            WORK2_NEXT(i,j) = c2 * ( &
              KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) - &
              KAPPA_THIC(i,j,kbt,k+1,bid) * SLX(i,j,kk,kbt,k+1,bid) )

            WORK3(i,j,kk) =  KAPPA_THIC(i,j,kbt,k,bid)  &
                           * SLY(i,j,kk,kbt,k,bid) * dz(k)

            WORK4(i,j,kk) = c2 * dzwr(k) * ( WORK3(i,j,kk)            &
              - KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) &
                                            * dz(k+1) )

            WORK4_NEXT(i,j) = c2 * ( &
              KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) - &
              KAPPA_THIC(i,j,kbt,k+1,bid) * SLY(i,j,kk,kbt,k+1,bid) )

            endif

            enddo
          enddo

We are unable to try Vtune as the code ran for eternity, like 1 day. Also no info as to why it showed such high speedup was not explained. Just a few red bars with time taken. They were in agreement to the time showed by omp timers added b/w the loops.

The compilation of unchanged code was with -O3 while changed code was with -O2. I can clearly rule out loop fusion as a reason(fusion of loops of work1, work2 which are hidden in : style language). As per the opt-report loops we fused for original code.

I can guarantee no Openmp was implemented. All flags are similar in both cases(except O3).Any explanation why is speedup as high as 4X.

UPDATE

Hi I did check the optrpt and found the following for unchanged and changed code

unchanged code

            remark #15448: unmasked aligned unit stride loads: 13
            remark #15449: unmasked aligned unit stride stores: 3
            remark #15450: unmasked unaligned unit stride loads: 3
            remark #15455: masked aligned unit stride stores: 6
            remark #15456: masked unaligned unit stride loads: 16

changed code

      remark #15448: unmasked aligned unit stride loads: 1
      remark #15454: masked aligned unit stride loads: 2
      remark #15455: masked aligned unit stride stores: 6
      remark #15456: masked unaligned unit stride loads: 16

TimP · ‎07-28-2015

It does look like the fusion of your WHERE version is incomplete, possibly as a consequence of the rank 2 array assignments, or possibly because that style is not so frequently used in critical performance situations which the compiler has been trained to optimize. You must recognize that the syntax of WHERE requires the compiler to start out by distributing the conditional to each individual assignment, so there is a lot more work to be done to get back to full sharing of operands by loop fusion.

jimdempseyatthecove · ‎07-29-2015

What I suspect to be happening is, in the 4x test case, the array LMASK is sparsely (or at least not densely) populated with .true..

Under this circumstance, the code may have been performing masked load/store operations where the entire mask is .false..

Jim Dempsey