Where statement and Vectorization

aketh_t_ · ‎06-22-2015

I would like to know, how where statement affects vectorization.

My belief is its bad for vectorization.

Here is a short part of the original code

 where ( LMASK )

            WORK1(:,:,kk) =  KAPPA_THIC(:,:,kbt,k,bid)  &
                           * SLX(:,:,kk,kbt,k,bid) * dz(k)
.
.
.
endwhere

Even the Optrpt seems to suggest the same.

LOOP BEGIN at /storage/home/aketh/cesm/cases/B_f45_g37/exe/ocn/source/hmix_gm.F90(3641,13)
remark #15389: vectorization support: reference work1 has unaligned access
remark #15389: vectorization support: reference hmix_gm_mp_kappa_thic_ has unaligned access
remark #15389: vectorization support: reference hmix_gm_submeso_share_mp_slx_ has unaligned access
remark #15381: vectorization support: unaligned access used inside loop body
remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
remark #15450: unmasked unaligned unit stride loads: 1
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 23
remark #15477: vector loop cost: 55.000
remark #15478: estimated potential speedup: 0.420
remark #15479: lightweight vector operations: 13
remark #15480: medium-overhead vector operations: 2
remark #15481: heavy-overhead vector operations: 3
remark #15488: --- end vector loop cost summary ---

But at the where statement it seemed to show some benefit. What does this mean. that the where loop runs faster?

LOOP BEGIN at /storage/home/aketh/cesm/cases/B_f45_g37/exe/ocn/source/hmix_gm.F90(3639,11)
<Multiversioned v2>
remark #15388: vectorization support: reference 4692 has aligned access
remark #15388: vectorization support: reference lmask has aligned access
remark #15300: LOOP WAS VECTORIZED
remark #15448: unmasked aligned unit stride loads: 1
remark #15449: unmasked aligned unit stride stores: 1
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 4
remark #15477: vector loop cost: 0.750
remark #15478: estimated potential speedup: 5.300
remark #15479: lightweight vector operations: 3
remark #15488: --- end vector loop cost summary ---
LOOP END

I rewrote the code as

WORK1(:,:,kk) = (1 + LMASK)*WORK1(:,:,kk) + (KAPPA_THIC(:,:,kbt,k,bid) &
* SLX(:,:,kk,kbt,k,bid) * dz(k)) * LMASK * -1

which seemed to work well as per optrpt.

remark #15478: estimated potential speedup: 1.930
remark #15479: lightweight vector operations: 20
remark #15480: medium-overhead vector operations: 1
remark #15487: type converts: 2
remark #15488: --- end vector loop cost summary ---

However final timings per iteration are as follows (Xeon runs only)

Xeon unchanged code 9.1004E-004 seconds per iteration

Xeon changed code 2.0971E-003 seconds per iteration

Am I doing something wrong in optimization here?

jimdempseyatthecove · ‎06-23-2015

The optimization reports seem to indicate that WORK1, KAPPA_THIC, and possibly LMASK were not aligned and/or not attributed as such.

As for your rewrote code, the two statements are not equivalent. In Fortran, only the least significant bit in a logical indicates .TRUE. or .FALSE. the remainder bits are undefined. You can specify fpscomp logical to assure 1 and 0 are used. However, this means your code is reliant upon something specified external to the code. And this leads to a potential for error.

Also you stated replacement is missing ()'s around (LMASK -1).

Instead of LMASK, create an RMASK of the same type as WORK1 and KAPPA_THIC. This then can contain 0.0 or 1.0. This will avoid the conversion from INTEGER to the real type of WORK1/KAPPA_THIC.

Jim Dempsey

TimP · ‎06-24-2015

As Jim points out, your way of using arithmetic equivalence between logical and integer is non-portable. First, you have arithmetic expressions which are permitted only as a DEC/Intel Fortran extension, so will not work with other compilers. Second, you would need -fpscomp logicals (or -standard-semantics) to make .true. have a numerical value of 1, and .false. 0, in accordance with Fortran 2008. Otherwise, I believe ifort .true. has a value of -1 but you can't rely on it. I have no desire to revert to FPS64 days when we would use .true. or .false. as a mask which could work on both integer (32 bits used out of a 64-bit maskable group) and real(all 64 bits used) and all sorts of conditional compilation was needed to make it work on other platforms. Part of the value proposition of MIC is the ability to work efficiently with standard source code.

Archaic usage of arithmetic expressions for conditional expression selection was done portably using SIGN intrinsic. As I've said repeatedly, MIC as well as recent Xeon have blend instruction which the compiler uses to generate more efficient code for MERGE selection (or possibly for WHERE).

WHERE...ELSEWHERE...ENDWHERE is notoriously inefficient even though WHERE without ELSEWHERE might be satisfactory. WHERE(condition) followed by a separate WHERE(.NOT.condition) may be much faster than ELSEWHERE. It may be a chicken and egg situation; if high profile applications used ELSEWHERE in performance critical context, it might get more attention to optimization. The situation isn't limited to ifort. As Jim said, you seem to want to confuse us by showing examples which aren't equivalent.

Alignment in a multi-rank array requires not only array alignment (-align array64byte, __attribute((aligned(64))), !dir$ aligned...) but the leading dimensions must be multiples of 64 bytes. You didn't mention whether you were aware nor did you show enough code. Pardon me if you've already seen us harping on this. If your array has size of first dimension fairly small (e.g. < 1000), and the array is not aligned strictly to cache, the option -qopt-assume-safe-padding may help.

aketh_t_ · ‎06-24-2015

Hi ,

I thought alignment means only adding the directive

!$dir aligned........

but now I realize the leading dimensions must also be 64.

It seems to explain why optrpt said unaligned.

my work1 dimension 29 * 33 * 2 and real(r8)

you guys are telling me I must rewrite it as work1(64*33*22).

jimdempseyatthecove · ‎06-24-2015

The alignment is in bytes, to byte size of cache line (64 bytes). With real(8) the byte alignment requirement (64/8) would indicate a multiple of 8 real(8)'s: (32,33,2) with the requirement that the array itself is aligned to 64 bytes.

However, in this case, I think the compiler is smart enough to "fuse" the implied loop over the use of WORK1(:,:,KK) such that the alignment is only required of the first element.

The aligned attribute must be on where the variable is defined, as well as potentially on the dummy arguments of the routines that are called with this aligned variable (assuming you want that routine to be optimally vectorized).

CAUTION, while you can place the aligned attribute on a dummy argument of an array with unknown provenance, should that array in fact not be aligned, you could experience a nasty Segment Fault (SIGSEGV).

Jim Dempsey

aketh_t_ · ‎06-24-2015

what exactly do you mean by dummy arguments here?

I am using xeon (AVX) not phi. is the cache line 64 bits again?

In fact no be aligned implies the leading array index is not a multiple of 64/32 etc?

James_C_Intel2 · ‎06-24-2015

what exactly do you mean by dummy arguments here?

The name by which an argument to a subroutine or function is known inside the scope of the subroutine or function. When you call the subroutine or function you pass an "actual argument" which is bound to the dummy argument. This is normal Fortran standardese which has been in use at least since it was FORTRAN. (In C/C++ you'd call it a parameter of the function, but in Fortran a parameter is a named constant...)

I am using xeon (AVX) not phi. is the cache line 64 bits again?

The cache line is 64 BYTES on both machines.

Michael_S_17 · ‎06-24-2015

WHERE...ELSEWHERE...ENDWHERE is notoriously inefficient even though WHERE without ELSEWHERE might be satisfactory. WHERE(condition) followed by a separate WHERE(.NOT.condition) may be much faster than ELSEWHERE.

Hi, as a WHERE...ELSEWHERE 'fan' I'd like to point to the Clerman and Spector: 'Modern Fortran - Style and Usage' book. On page 226/227 they explain some conditions under which WHERE...ELSWHERE might not perform well, but show also some nice 'compression technique' to improve performance then.

best regards michael

jimdempseyatthecove · ‎06-24-2015

The WHERE statement essentially constructs a filter.

When the condition list is preponderantly populated with .TRUE. .AND. if the code is amenable to vectorization, then it may be best to replace the where with a computational statement that can produce both results (on essentially a NOP and the other with the value you want).

When the condition list is .NOT. preponderantly populated with .TRUE. .OR. if the code is .NOT. amenable to vectorization, then use the WHERE statement.

The tradeoff is the otherwise unnecessary computation and memory fetch/store against the gain of better vectorization.

If only 1 value out of 8 (AVX512), or 1 value out of 4 (AVX256) (evenly disbursed) were .TRUE. in the condition array, then the performance of the WHERE verses expression replacing where, might be comparable.

When the numbers are not evenly disbursed, then the WHERE may be better.

This is not a case where making a choice in programming style will guarantee better performance.

Jim Dempsey

TimP · ‎06-24-2015

32-byte alignment would be sufficient on host, until AVX512 comes along. On this forum, the assumption is you're optimizing for MIC. You might be getting mis-targeted advice about your particular case if you were running on an old host CPU.

If the compiler collapsed the loops, as Jim points out seems desirable, there should be a notation to that effect in the vector report. Otherwise, 64-byte alignment would require the leading dimension to be a multiple of 16 for 32-bit data types, 8 for 64-bit data.

Vectorization of a conditional assumes that it's worth while to perform speculative evaluation of both the .true. and .false. branches and merge the results.

In my benchmarks at https://github.com/tprince/lcd ; I show some cases of WHERE and WHERE(condition)...WHERE(.not.condition) performing well, along with cases where MERGE is better, and some cases where plain IF..ELSE blocks are good with alignment directives.

If Jim means that it's not possible to make a programming style between various choices of Fortran syntax which always gives optimum performance, I am in agreement.

aketh_t_ · ‎06-24-2015

We do have xeon phi's.

But our interest is to vectorize and obtain performance on Xeons later to phi's and compare.

aketh_t_ · ‎06-24-2015

Also I would like to point out

1) I do not carry most of my data as dummy instead through modules.

My dz and dzwr etc are from grids.F90 file. which i use by use block. So is it sufficient to declare them aligned in the blocks.F90 file.

There are variables like TLT%KLEVEL which are public to the existing module but not local to the subroutine. but are derived.

So I am expected to declare them as aligned with !dir$ aligned 32 for xeon.

2) I must increase my array size to 32 * 33 *2 is it?

jimdempseyatthecove · ‎06-25-2015

>>But our interest is to vectorize and obtain performance on Xeons later to phi's and compare.

Aligning to the larger vector size won't hurt performance on the host processor. A little waste of RAM. Note, you can conditionalize the alignment:

! in some module
!DIR$ IF DEFINED(__MIC__)
INTEGER, PARAMETER :: VECTOR_ALIGNMENT = 64
!DIR$ ELSEIF DEFINED(YOUR_HOST_ALIGNMENT)
INTEGER, PARAMETER :: VECTOR_ALIGNMENT = YOUR_HOST_ALIGNMENT
!DIR$ ELSE
INTEGER, PARAMETER :: VECTOR_ALIGNMENT = 32
!DIR$ ENDIF

When building for pre-AVX you could define YOUR_HOST_ALIGNMENT=16

You can then use VECTOR_ALIGNMENT for code both inside and outside offload regions as well as Host-only and MIC-only code.

>>2) I must increase my array size to 32 * 33 *2 is it?

Before you change the dimensions, .AND. provided your arrays are .NOT. POINTER (meaning the allocation is known to be contiguous), .AND. your allocations are aligned, then give this a try:

ASSOCIATE (WORK1slice => WORK1(:,:,kk), KAPPA_THICslice => KAPPA_THIC(:,:,kbt,k,bid), SLXslice => SLX(:,:,kk,kbt,k,bid), RMASKslice => RMASK(:,:))
  !DIR$ ASSUME_ALIGNED WORK1slice, KAPPA_THICslice, SLXslice, RMASKslice
  WORK1slice = (RMASKslice - 1.0) * WORK1slice + (KAPPA_THICslice * SLXslice * dz(k) * RMASKslice)
END ASSOCIATE

Jim Dempsey

jimdempseyatthecove · ‎06-25-2015

Steve,

Will the above produce 1D array slices?

If not, is there a syntax that does:

Unknowen as to if valid: ASSOCIATE (WORK1slice(:) => WORK1(:,:,kk))

Or perhaps with TRANSFER that does so without copying the data? IOW use different shape on same data.

While a POINTER could be used, it might interfere with vectorization, but then the !DIR$ ASSUME_ALIGNED might compensate for it

Jim Dempsey