aketh_t_

07-27-2015
11:02 PM

Unexplained speedup how?

Hi all,

I have got a speedup of approximately 4x with just rewriting a few where statements in my loop to explicit do loops as illustrated below.

Unchanged code

where ( LMASK ) WORK1(:,:,kk) = KAPPA_THIC(:,:,kbt,k,bid) & * SLX(:,:,kk,kbt,k,bid) * dz(k) WORK2(:,:,kk) = c2 * dzwr(k) * ( WORK1(:,:,kk) & - KAPPA_THIC(:,:,ktp,k+1,bid) * SLX(:,:,kk,ktp,k+1,bid) & * dz(k+1) ) WORK2_NEXT = c2 * ( & KAPPA_THIC(:,:,ktp,k+1,bid) * SLX(:,:,kk,ktp,k+1,bid) - & KAPPA_THIC(:,:,kbt,k+1,bid) * SLX(:,:,kk,kbt,k+1,bid) ) WORK3(:,:,kk) = KAPPA_THIC(:,:,kbt,k,bid) & * SLY(:,:,kk,kbt,k,bid) * dz(k) WORK4(:,:,kk) = c2 * dzwr(k) * ( WORK3(:,:,kk) & - KAPPA_THIC(:,:,ktp,k+1,bid) * SLY(:,:,kk,ktp,k+1,bid) & * dz(k+1) ) WORK4_NEXT = c2 * ( & KAPPA_THIC(:,:,ktp,k+1,bid) * SLY(:,:,kk,ktp,k+1,bid) - & KAPPA_THIC(:,:,kbt,k+1,bid) * SLY(:,:,kk,kbt,k+1,bid) ) endwhere

Changed code

do j=1,ny_block do i=1,nx_block if ( LMASK(i,j) ) then WORK1(i,j,kk) = KAPPA_THIC(i,j,kbt,k,bid) & * SLX(i,j,kk,kbt,k,bid) * dz(k) WORK2(i,j,kk) = c2 * dzwr(k) * ( WORK1(i,j,kk) & - KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) & * dz(k+1) ) WORK2_NEXT(i,j) = c2 * ( & KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) - & KAPPA_THIC(i,j,kbt,k+1,bid) * SLX(i,j,kk,kbt,k+1,bid) ) WORK3(i,j,kk) = KAPPA_THIC(i,j,kbt,k,bid) & * SLY(i,j,kk,kbt,k,bid) * dz(k) WORK4(i,j,kk) = c2 * dzwr(k) * ( WORK3(i,j,kk) & - KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) & * dz(k+1) ) WORK4_NEXT(i,j) = c2 * ( & KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) - & KAPPA_THIC(i,j,kbt,k+1,bid) * SLY(i,j,kk,kbt,k+1,bid) ) endif enddo enddo

We are unable to try Vtune as the code ran for eternity, like 1 day. Also no info as to why it showed such high speedup was not explained. Just a few red bars with time taken. They were in agreement to the time showed by omp timers added b/w the loops.

The compilation of unchanged code was with -O3 while changed code was with -O2. I can clearly rule out loop fusion as a reason(fusion of loops of work1, work2 which are hidden in : style language). As per the opt-report loops we fused for original code.

I can guarantee no Openmp was implemented. All flags are similar in both cases(except O3).Any explanation why is speedup as high as 4X.

UPDATE

Hi I did check the optrpt and found the following for unchanged and changed code

unchanged code

remark #15448: unmasked aligned unit stride loads: 13

remark #15449: unmasked aligned unit stride stores: 3

remark #15450: unmasked unaligned unit stride loads: 3

remark #15455: masked aligned unit stride stores: 6

remark #15456: masked unaligned unit stride loads: 16

changed code

remark #15448: unmasked aligned unit stride loads: 1

remark #15454: masked aligned unit stride loads: 2

remark #15455: masked aligned unit stride stores: 6

remark #15456: masked unaligned unit stride loads: 16

TimP

07-28-2015
05:15 AM

jimdempseyatthecove

07-29-2015
07:47 AM

What I suspect to be happening is, in the 4x test case, the array LMASK is sparsely (or at least not densely) populated with .true..

Under this circumstance, the code may have been performing masked load/store operations where the entire mask is .false..

Jim Dempsey

