not vectorizing for no reason

aketh_t_ · ‎05-21-2015

Hi all,

I have isolated a small section of a loop in my code to vectorize and test for other kinds of optimization a well(like alignment etc)

Here is the actual code.

WORK1(:,:,kk) = KAPPA_THIC(:,:,kbt,k,bid) * SLX(:,:,kk,kbt,k,bid) * dz(k)

The optrpt says this

LOOP BEGIN at loop.F90(91,13)
remark #15541: outer loop was not auto-vectorized: consider using SIMD directive
remark #25436: completely unrolled by 8

LOOP BEGIN at loop.F90(91,13)
remark #15388: vectorization support: reference work1 has aligned access
remark #15388: vectorization support: reference slx has aligned access
remark #15388: vectorization support: reference kappa_thic has aligned access
remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
remark #15399: vectorization support: unroll factor set to 2
remark #15448: unmasked aligned unit stride loads: 1
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 23
remark #15477: vector loop cost: 94.500
remark #15478: estimated potential speedup: 0.480
remark #15479: lightweight vector operations: 15
remark #15480: medium-overhead vector operations: 1
remark #15481: heavy-overhead vector operations: 2
remark #15488: --- end vector loop cost summary ---
remark #25436: completely unrolled by 8
LOOP END

However I re-wrote the loop as

!dir$ SIMD
do i=1,8
do j=1,8
WORK1(i,j,kk) = KAPPA_THIC(i,j,kbt,k,bid) * SLX(i,j,kk,kbt,k,bid) * dz(k)
end do
end do

The optrpt is blank here(i.e no mention weather it was vectorized or not)

Also i have got a warning stating

warning #13379: was not vectorized with "simd"

even with -vec-threshold0 option. is seems to fail.

Any help or ideas why this is happening???

jimdempseyatthecove · ‎05-21-2015

Change your loop levels. IOW have the left most index of your arrays be the inner most loop control variable.

!dir$ SIMD
do j=1,8
do i=1,8
WORK1(i,j,kk) = KAPPA_THIC(i,j,kbt,k,bid) * SLX(i,j,kk,kbt,k,bid) * dz(k)
end do
end do

The compiler optimization can at times do this (interchange loop order) for you, but in the presence of !DIR$ SIMD you are directing the compiler that your code is suitable for SIMD, and this may result in lesser optimization analysis by the compiler. Also, if kk is the immediately next outer loop (with no intervening code that modifies the input sections of the arrays listed on the rhs) , then consider moving the !DIR$ SIMD out one loop level.

Jim Dempsey

aketh_t_ · ‎05-21-2015

1)What is IOW?

2)How does that help? I would be fetching 1st element every row rather than columns. that would mean unaligned access to data which is supposed to be non beneficial right?

3)yes i have a kk loop but before which is a where statement

kk loop

where(lmask)

mycode

end where

end kk loop

TimP · ‎05-21-2015

The quoted report indicates that the rank 2 assignment vectorization was abandoned due to a large loss in performance, possibly due to picking the wrong loop nest order. I wonder why the compiler doesn't handle multiple rank assignment well in cases like this which don't appear too difficult.

On such short loops, vectorization might be much improved with collapse, so you might try !$omp simd collapse(2) in place of Jim's simd directive, if the data declarations qualify (e.g. leading array dimensions are 8). Include the aligned clause if possible. If you aren't using -align array64byte that could be important, as might -opt-assume-safe-padding.

jimdempseyatthecove · ‎05-21-2015

IOW is short for "in other words"

In Fortran, the left most index represents data that is closest to data (stride 1 arrays, the elements are adjacent). Adjacent data favors vectorization. IOW one load instruction can (may) fetch a complete vector of data.

Your original loop structure processed:

Every 8th element starting at first element for 8 elements
then every 8th element starting at second element for 8 elements
etc...

The above being performed using scalar.

Changing loop nest order permits:

Process a vector's worth of elements at a time stepping a vector's width down the array for the entire array.

The above is a simplistic description and greatly depends on the expression on right side of = in your code.

Think of your arrays as a string of nuts and that you are interested in grabbing a hand full of nuts. Would it be more efficient to:

a) Pick every 8th nut out of the string of nuts. Or,
b) Work down the string of nuts handful at a time.

Jim Dempsey

TimP · ‎05-21-2015

Ideally, your 8x8 rank 2 array sections would be processed by 4 simd parallel instructions, if single precision, or 8 simd parallel repetitions, if double. If you succeed in vectorizing with non-sequential access, at best you will get 2 array elements to or from memory per clock cycle on KNC, so it costs at least a factor of 2 in performance.

Frances_R_Intel · ‎05-22-2015

Aketh,

As to why you didn't get a vectorization report from your explicit loop - as Jim Dempsey pointed out, you have a stride of 8. The compiler didn't see any value in vectorizing this code nor did it see any value in unrolling it because of the stride. As to why it didn't interchange the loops, I don't know - perhaps if you upped the optimization level? By default, in the vectorization report, the compiler only tells you about the loops it was able to do something with. If you want information on what the compiler didn't do, you will need to increase the level of the report.

The point Tim makes is very interesting - the optimal way to deal with this loop would be 4 simd parallel instructions if single precision, or 8 simd parallel repetitions if double. Using the 2015 compiler and an ASSUME_ALIGNED directive, the implicit loop vectorizes for me and produces just what Tim says it should - which raises the question: which version of the compiler are you using and why is it not doing the same thing?

aketh_t_ · ‎05-22-2015

Hi,

I here am using the intel 15 set of compilers.

Frances_R_Intel · ‎05-22-2015

Your listing says that it knows your arrays are aligned but does it change anything if you explicitly tell the compiler that? (!DIR$ ASSUME_ALIGNED var_name:num_bits)

Also, returning to your original question - have you tried increasing the optimization report level to see if you get any information about the loop that the compiler didn't modify?

aketh_t_ · ‎05-22-2015

My qopt-report-phase=5.

aketh_t_ · ‎05-24-2015

hi all

I tried this

!dir$ SIMD
do j=1,8
do i=1,8
WORK1(i,j,kk) = KAPPA_THIC(i,j,kbt,k,bid) &
* SLX(i,j,kk,kbt,k,bid) * dz(k)
end do
end do

Still it says test.F90(80): (col. 7) warning #13379: was not vectorized with "simd"

I will attach the complete code and the optrpt.

Also the .optrpt which is still has no useful data, even with -qopt-report=5

I am yet to try !DIR$ ASSUME_ALIGNED var_name:num_bits. I do not know the value of num_bits.

any help?

TimP · ‎05-25-2015

You've pruned too much from your example, leaving just a bit of dead code, so the compiler eliminates it rather than vectorizing.

aketh_t_ · ‎05-26-2015

hi you guys can replace the test code with this code.(Do not change any variable declarations)

!dir$ SIMD
do k=1,km-1

do kk=1,2

LMASK = TLT%K_LEVEL(:,:,bid) == k .and. &
TLT%K_LEVEL(:,:,bid) < KMT(:,:,bid) .and. &
TLT%ZTW(:,:,bid) == 1

where ( LMASK )

WORK1(:,:,kk) = KAPPA_THIC(:,:,kbt,k,bid) &
* SLX(:,:,kk,kbt,k,bid) * dz(k)
WORK2(:,:,kk) = c2 * dzwr(k) * ( WORK1(:,:,kk) &
- KAPPA_THIC(:,:,ktp,k+1,bid) * SLX(:,:,kk,ktp,k+1,bid) &
* dz(k+1) )
end where
end do
end do

The optrpt now says

LOOP BEGIN at test.F90(80,7)
remark #15336: simd loop was not vectorized: conditional assignment to a scalar [ test.F90(90,13) ]
remark #13379: loop was not vectorized with "simd"

TimP · ‎05-26-2015

This simd asks for vectorization over k when you should be aiming to vectorize over the first dimension of work1 and work2. There are possibilities but I'm away from my Mic box for a week plus. Merge intrinsic tends to be more effective than where, and you probably need do loops if simd directives are needed. Vector always can work with array assignment.

aketh_t_ · ‎05-29-2015

Hi guys,

With all your help I vectorized the loops at various levels. (Code attached)

I tried various configurations of the code.

1) !dir$ SIMD at K loop. but potential speedup was low ~1.4. also was a remark.

remark #15388: vectorization support: reference dz has aligned access [ test.F90 (109,13) ]

 !dir$ SIMD   
        do k=1,km-1
        do kk=1,2

         do j=1,ny_block
         do i=1,nx_block

           LMASK(i,j) = TLT%K_LEVEL(i,j,bid) == k  .and.      &
                  TLT%K_LEVEL(i,j,bid) < KMT(i,j,bid)  .and. &
                  TLT%ZTW(i,j,bid) == 1          

          !if ( LMASK(i,j) ) then 

            WORK1(i,j,kk) =  KAPPA_THIC(i,j,kbt,k,bid)  &
                           *  SLX(i,j,kk,kbt,k,bid) * dz(k)

            WORK3(i,j,kk) =  KAPPA_THIC(i,j,kbt,k,bid)  &
                           * SLY(i,j,kk,kbt,k,bid) * dz(k)

         enddo
         enddo

2) SIMD at kk level loop.estimated potential speedup: 2.240

3) SIMD at j level, estimated potential speedup: 0.720

4) Split the loop across LMASK and WORK calculations. speedup were respectively 1.120 and 0.290

!dir$ SIMD  
         do j=1,ny_block
         do i=1,nx_block

           LMASK(i,j) = TLT%K_LEVEL(i,j,bid) == k  .and.      &
                  TLT%K_LEVEL(i,j,bid) < KMT(i,j,bid)  .and. &
                  TLT%ZTW(i,j,bid) == 1

         enddo
         enddo

         !dir$ SIMD
         do j=1,ny_block
         do i=1,nx_block
          !if ( LMASK(i,j) ) then 

            WORK1(i,j,kk) =  KAPPA_THIC(i,j,kbt,k,bid)  &
                           *  SLX(i,j,kk,kbt,k,bid) * dz(k)

            WORK3(i,j,kk) =  KAPPA_THIC(i,j,kbt,k,bid)  &
                           * SLY(i,j,kk,kbt,k,bid) * dz(k)

          !endif


         enddo
         enddo

        enddo
        enddo

5) the best performance was at innermost loop, but that would waste time (If I am right) as I do SIMD over little data and loop over it, means I loose performance actually for small gains over small data.

 do k=1,km-1
        do kk=1,2

         do j=1,ny_block
         !dir$ SIMD 
         do i=1,nx_block

           LMASK(i,j) = TLT%K_LEVEL(i,j,bid) == k  .and.      &
                  TLT%K_LEVEL(i,j,bid) < KMT(i,j,bid)  .and. &
                  TLT%ZTW(i,j,bid) == 1

         enddo
         enddo

         do j=1,ny_block
         !dir$ SIMD
         do i=1,nx_block
          !if ( LMASK(i,j) ) then 

            WORK1(i,j,kk) =  KAPPA_THIC(i,j,kbt,k,bid)  &
                           *  SLX(i,j,kk,kbt,k,bid) * dz(k)

            WORK3(i,j,kk) =  KAPPA_THIC(i,j,kbt,k,bid)  &
                           * SLY(i,j,kk,kbt,k,bid) * dz(k)

          !endif
         enddo
         enddo

        enddo
        enddo

Which of these models do I use? Any other changes that can be made for improvements?

Also attached is the working code with one of the configurations.

jimdempseyatthecove · ‎05-29-2015

Why are you producing LMASK and not using it?

Was it to avoid performing your WORK1 and WORK3 calculations, or is it used for something else later on? Does writing WORKn when LMASK cell position is .false. break your program??

If you do not use LMASK elsewhere, and you do not use it to avoid WORK? calculations, then why produce LMASK (note the compiler might eliminate it as "dead code", but do not rely on it).

In the other forum where I suggested to you to try

if(LMASK(i,j) WORK1(i,j,kk) = ...
if(LMASK(i,j) WORK3(i,j,kk) = ...

As opposed to using the if test around the two statements, this was specifically structured to aid the compiler in determining that it can use masked vector stores, and thus be in compliance with your original code where the values of WORK1 and WORK3 were not modified in the .false. case of LMASK.

Note, though it seems like you are performing the test twice, you are in fact performing it twice, or rather vector wide number of times in the same instruction(s) via the simd vector test and mask move. Having the single if statement enclosing the "block" of two statements makes it harder for the compiler to determine if the code can be vectorized with masked move.

Jim Dempsey

aketh_t_ · ‎05-29-2015

LMASK is used to perform calculation of WORK1 and WORK3 conditionally.

I am avoiding it because

1)I can always based on LMASK use intrinsic merge and recorrect my data. If it is present it wouldnt perform SIMD as i have introduced conditionality.

2)Correct me if I am wrong.I always thought intrinsic merge was faster than conditional SIMD.

TimP · ‎05-29-2015

In this context, there may be no difference between if and merge. Either would require simd or vector aligned directive to resolve "protects exception" as well as asserting sufficient alignment, if your array sizes and compiler options are correct for alignment.

jimdempseyatthecove · ‎05-29-2015

Intrinsic merge is not necessarily faster nor better to use than a conditional SIMD masked store, and neither is conditional SIMD necessarily faster or better than intrinsic merge.

Intrinsic merge/store typically requires processing the output twice: (after/during the mask is produce) once to produce the new data, and a second time for the merge. When these two operations can be performed in the same pass (via SIMD instructions) it is often (but not always) faster than using an intrinsic merge. The gating factors are the amount of work avoidance in producing the intermediary results, and the overhead of parsing the data twice versed performing masked move. Generally speaking, when the preponderance of the mask is .true. the SIMD route may be faster, when the preponderance of the mask is .false. then the intrinsic merge may be faster. You, as the programmer, will have to make this decision based on representative input data.

Jim Dempsey

aketh_t_ · ‎05-29-2015

So,

what do I do next to improve performance.

Should I collapse the outer loops somehow and try openmp.

TimP · ‎05-29-2015

If you could fuse those loops so that lmask could be a scalar private, openmp parallel outer loop may be worth trying.