Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
1689 Discussions

## Unexplained speedup how? Beginner
135 Views

Hi all,

I have got a speedup of approximately 4x with just rewriting a few where statements in my loop to explicit do loops as illustrated below.

Unchanged code

```where ( LMASK )

WORK1(:,:,kk) =  KAPPA_THIC(:,:,kbt,k,bid)  &
* SLX(:,:,kk,kbt,k,bid) * dz(k)
WORK2(:,:,kk) = c2 * dzwr(k) * ( WORK1(:,:,kk)            &
- KAPPA_THIC(:,:,ktp,k+1,bid) * SLX(:,:,kk,ktp,k+1,bid) &
* dz(k+1) )

WORK2_NEXT = c2 * ( &
KAPPA_THIC(:,:,ktp,k+1,bid) * SLX(:,:,kk,ktp,k+1,bid) - &
KAPPA_THIC(:,:,kbt,k+1,bid) * SLX(:,:,kk,kbt,k+1,bid) )

WORK3(:,:,kk) =  KAPPA_THIC(:,:,kbt,k,bid)  &
* SLY(:,:,kk,kbt,k,bid) * dz(k)
WORK4(:,:,kk) = c2 * dzwr(k) * ( WORK3(:,:,kk)            &
- KAPPA_THIC(:,:,ktp,k+1,bid) * SLY(:,:,kk,ktp,k+1,bid) &
* dz(k+1) )

WORK4_NEXT = c2 * ( &
KAPPA_THIC(:,:,ktp,k+1,bid) * SLY(:,:,kk,ktp,k+1,bid) - &
KAPPA_THIC(:,:,kbt,k+1,bid) * SLY(:,:,kk,kbt,k+1,bid) )

endwhere```

Changed code

```do j=1,ny_block
do i=1,nx_block

WORK1(i,j,kk) =  KAPPA_THIC(i,j,kbt,k,bid)  &
* SLX(i,j,kk,kbt,k,bid) * dz(k)

WORK2(i,j,kk) = c2 * dzwr(k) * ( WORK1(i,j,kk)            &
- KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) &
* dz(k+1) )

WORK2_NEXT(i,j) = c2 * ( &
KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) - &
KAPPA_THIC(i,j,kbt,k+1,bid) * SLX(i,j,kk,kbt,k+1,bid) )

WORK3(i,j,kk) =  KAPPA_THIC(i,j,kbt,k,bid)  &
* SLY(i,j,kk,kbt,k,bid) * dz(k)

WORK4(i,j,kk) = c2 * dzwr(k) * ( WORK3(i,j,kk)            &
- KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) &
* dz(k+1) )

WORK4_NEXT(i,j) = c2 * ( &
KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) - &
KAPPA_THIC(i,j,kbt,k+1,bid) * SLY(i,j,kk,kbt,k+1,bid) )

endif

enddo
enddo```

We are unable to try Vtune as the code ran for eternity, like 1 day. Also no info as to why it showed such high speedup was not explained. Just a few red bars with time taken. They were in agreement to the time showed by omp timers added b/w the loops.

The compilation of unchanged code was with -O3 while changed code was with -O2. I can clearly rule out loop fusion as a reason(fusion of loops of work1, work2 which are hidden in : style language). As per the opt-report loops we fused for original code.

I can guarantee no Openmp was implemented. All flags are similar in both cases(except O3).Any explanation why is speedup as high as 4X.

UPDATE

Hi I did check the optrpt and found the following for unchanged and changed code

unchanged code

remark #15449: unmasked aligned unit stride stores: 3
remark #15455: masked aligned unit stride stores: 6

changed code

remark #15455: masked aligned unit stride stores: 6

2 Replies Black Belt
135 Views

It does look like the fusion of your WHERE version is incomplete, possibly as a consequence of the rank 2 array assignments, or possibly because that style is not so frequently used in critical performance situations which the compiler has been trained to optimize.  You must recognize that the syntax of WHERE requires the compiler to start out by distributing the conditional to each individual assignment, so there is a lot more work to be done to get back to full sharing of operands by loop fusion. Black Belt
135 Views

What I suspect to be happening is, in the 4x test case, the array LMASK is sparsely (or at least not densely) populated with .true..

Under this circumstance, the code may have been performing masked load/store operations where the entire mask is .false..

Jim Dempsey 