Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development Topics
- Intel® Moderncode for Parallel Architectures
- Unexplained speedup how?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

aketh_t_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-27-2015
11:02 PM

50 Views

Unexplained speedup how?

Hi all,

I have got a speedup of approximately 4x with just rewriting a few where statements in my loop to explicit do loops as illustrated below.

Unchanged code

where ( LMASK ) WORK1(:,:,kk) = KAPPA_THIC(:,:,kbt,k,bid) & * SLX(:,:,kk,kbt,k,bid) * dz(k) WORK2(:,:,kk) = c2 * dzwr(k) * ( WORK1(:,:,kk) & - KAPPA_THIC(:,:,ktp,k+1,bid) * SLX(:,:,kk,ktp,k+1,bid) & * dz(k+1) ) WORK2_NEXT = c2 * ( & KAPPA_THIC(:,:,ktp,k+1,bid) * SLX(:,:,kk,ktp,k+1,bid) - & KAPPA_THIC(:,:,kbt,k+1,bid) * SLX(:,:,kk,kbt,k+1,bid) ) WORK3(:,:,kk) = KAPPA_THIC(:,:,kbt,k,bid) & * SLY(:,:,kk,kbt,k,bid) * dz(k) WORK4(:,:,kk) = c2 * dzwr(k) * ( WORK3(:,:,kk) & - KAPPA_THIC(:,:,ktp,k+1,bid) * SLY(:,:,kk,ktp,k+1,bid) & * dz(k+1) ) WORK4_NEXT = c2 * ( & KAPPA_THIC(:,:,ktp,k+1,bid) * SLY(:,:,kk,ktp,k+1,bid) - & KAPPA_THIC(:,:,kbt,k+1,bid) * SLY(:,:,kk,kbt,k+1,bid) ) endwhere

Changed code

do j=1,ny_block do i=1,nx_block if ( LMASK(i,j) ) then WORK1(i,j,kk) = KAPPA_THIC(i,j,kbt,k,bid) & * SLX(i,j,kk,kbt,k,bid) * dz(k) WORK2(i,j,kk) = c2 * dzwr(k) * ( WORK1(i,j,kk) & - KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) & * dz(k+1) ) WORK2_NEXT(i,j) = c2 * ( & KAPPA_THIC(i,j,ktp,k+1,bid) * SLX(i,j,kk,ktp,k+1,bid) - & KAPPA_THIC(i,j,kbt,k+1,bid) * SLX(i,j,kk,kbt,k+1,bid) ) WORK3(i,j,kk) = KAPPA_THIC(i,j,kbt,k,bid) & * SLY(i,j,kk,kbt,k,bid) * dz(k) WORK4(i,j,kk) = c2 * dzwr(k) * ( WORK3(i,j,kk) & - KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) & * dz(k+1) ) WORK4_NEXT(i,j) = c2 * ( & KAPPA_THIC(i,j,ktp,k+1,bid) * SLY(i,j,kk,ktp,k+1,bid) - & KAPPA_THIC(i,j,kbt,k+1,bid) * SLY(i,j,kk,kbt,k+1,bid) ) endif enddo enddo

We are unable to try Vtune as the code ran for eternity, like 1 day. Also no info as to why it showed such high speedup was not explained. Just a few red bars with time taken. They were in agreement to the time showed by omp timers added b/w the loops.

The compilation of unchanged code was with -O3 while changed code was with -O2. I can clearly rule out loop fusion as a reason(fusion of loops of work1, work2 which are hidden in : style language). As per the opt-report loops we fused for original code.

I can guarantee no Openmp was implemented. All flags are similar in both cases(except O3).Any explanation why is speedup as high as 4X.

UPDATE

Hi I did check the optrpt and found the following for unchanged and changed code

unchanged code

remark #15448: unmasked aligned unit stride loads: 13

remark #15449: unmasked aligned unit stride stores: 3

remark #15450: unmasked unaligned unit stride loads: 3

remark #15455: masked aligned unit stride stores: 6

remark #15456: masked unaligned unit stride loads: 16

changed code

remark #15448: unmasked aligned unit stride loads: 1

remark #15454: masked aligned unit stride loads: 2

remark #15455: masked aligned unit stride stores: 6

remark #15456: masked unaligned unit stride loads: 16

Link Copied

2 Replies

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-28-2015
05:15 AM

50 Views

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

07-29-2015
07:47 AM

50 Views

What I suspect to be happening is, in the 4x test case, the array LMASK is sparsely (or at least not densely) populated with .true..

Under this circumstance, the code may have been performing masked load/store operations where the entire mask is .false..

Jim Dempsey

For more complete information about compiler optimizations, see our Optimization Notice.