"Best" code form for ensuring vectorization? - Page 2

Ioannis_K_ · ‎04-07-2023

Hello,

I have a (probably basic and very easy) question corresponding to effective and efficient code vectorization in the Fortran compiler.

I have the following line of code, whose goal is to give the value of a vector a, in terms of 6 other vectors b, c, d, e, f and g (all vectors have a dimension of 256):

a(:) = b(:)*c(:) + d(:)*e(:) + f(:)*g(:)

When I build the code, I see that the optimization report mentions that this line of code has been vectorized.

My question is: should I pursue shaping my code differently to ensure the maximum possible vectorization speedup? For instance, would I get a better vectorized program if I defined vector a in 3 stages as follows:

a(:) = b(:)*c(:)

a(:) = a(:) + d(:)*e(:)

a(:) = a(:) + f(:)*g(:)

Also, does it make any difference if I write the command " a(:) = b(:)*c(:) ", instead of having an explicit do loop:

do i = 1,256

a(i) = b(i)*c(i) + d(i)*e(i) + f(i)*g(i)

end do

Many thanks in advance for the help.

TobiasK · ‎06-19-2023

Hi @Ioannis_K_

in addition to what Ron and Jim already said (unaligned loads are not always that bad) I would highly recommend you to look at Vtune to see what is your actual bottle neck. If it's memory bandwidth vectorization will not be of much help. A roofline model will help you to see how far you can get. In your current example, you are limited by loading the large arrays, if you see a high memory load bandwidth which is close to peak, everything is fine.

Looking at your code raise a few questions:

1) why do you zero f and perform an update in the subsequent loop? That code does not make much sense, write directly to f without zeroing and without updating.

2) you access the arrays nx,ny,nz multiple times across both loops. If you can merge the loops you might be able to keep the values of nx,ny,nz inside registers.
3) similar to 2) if you need a* arrays only for updating / calculating f it should be possible to keep those values inside registers

4) OMP PARALLEL DO SIMD does not make much sense for Nvec1=128, Ron was referring to just OMP SIMD without PARALLEL DO.

5) would it be possible to remove the nvec / nvec1 construct and just let the compiler unroll the inner loop?

Best
Tobias

Ioannis_K_ · ‎06-14-2023

Yes, the configuration is already set to x64.

jimdempseyatthecove · ‎06-14-2023

If possible, change your J loop to something like

J2 = 1

DO J=1+iel, Nvec+iel

nx1(J2) = nx(J) ! eliminate the IEL+J

J2 = J2 + 1

end do

And remove the assume(mod...

Jim Dempsey

Ron_Green · ‎06-14-2023

there may be some assumptions that you are making that should be examined.

1) not all unaligned accesses are a bad thing. And are necessary for better performance

consider your example

  1 subroutine vec
  2 implicit none
  3 !dir$ attributes align:64 :: nx1, nx
  4 real, allocatable :: nx1(:), nx(:)
  5 integer :: J
  6 integer :: Nvec1
  7 integer :: iel
  8 
  9 iel = 4
 10 Nvec1 = 128
 11 allocate ( nx1(Nvec1), nx(Nvec1) )
 12 
 13 !dir$ assume_aligned nx:64, nx1:64
 14 do J=1,Nvec1-iel
 15   nx1(J) = nx(iel+J) + 42.0
 16 end do !J
 17 
 18 end subroutine vec

First, in the OPTRPT you need to ignore the PEEL and REMAINDER loops. those accesses may be unaligned, especially the PEEL loop. It's intent is to get to a vector alignment in the kernel loop.

Focus on line 15. In this assignment: you do understand that the assignment of nx1(J) involves an update. This requires a WRITE. WRITE is more expensive in cycles than READS in memory. You know this, correct? SO the compiler will favor optimizing the write to NX1.

So it will try to align the accesses to NX1.

Now, if it does this, what about NX? NX(J+iel). The compiler vectorization will do vector ops that transforms the loop to a loop incrementing by the vector length (elements) , To help visualize this, consider an unrolled loop that would look something like this (4 element vectorization)

do J=1,Nvec1-iel, 4
  nx1(J) = nx(iel+J) + 42.0
  nx1(J+1) = nx(iel+J+1) + 42.0
  nx1(J+2) = nx(iel+J+2) + 42.0
  nx1(J+3) = nx(iel+J+3) + 42.0
end do

Obviously this is NOT the code the compiler creates. But it does run the loop in blocks of vector length, like 4 above. the loop body is reduced to vector instructions. in this case 4 elements at a time in the vector registers..

SO we do aligned access for NX1. but if we do that, we need 4 NX elements for the data to assign to NX1 4 element.

nx(J+iel), nx(J+iel+1), nx(J+iel+2), nx(J+iel+3)

It's the 'iel' that is the problem. Unless iel is 0, or a multiple of 4, the accesses for the 4 elements of NX will be split across 2 cache lines, assuming a 4 element cache line (actual vector lengths depend on hardware vectorization support. I'm using 4 or 128bit vectorization here as a simple example). And since iel is a VARIABLE, the compiler cannot ASSUME any value of iel. So it has to chose an unaligned access. Simply has to.

Several things here - the compiler uses a "Cost model" that looks at ALL the accesses in your loop. For each it has an idea of how costly it is to get and/or store that data. It balances those accesses to determine what variable(s) are best to access aligned, and hence the others if they are NOT on the same boundaries to access unaligned. Cost model. Best case, based on assumptions. Assumption in this example is that we do not know iel beforehand, hence don't worry about it and just use unaligned loads/stores. As Jim mentioned, with any modern Xeon or I5,7,9 these unaligned loads/store are really not all that much more expensive than aligned. There are probably other things in your code to worry about.

Hopefully this helps. Now if you can make 'iel' a PARAMETER at compile time, that constant will allow the compiler to understand nx(J+iel) far better and it may change the cost model, and hence the vectorization.

IT's complicated in real world code. My takeaway - compiler vectorization experts tune the cost model over many years. These people get paid big dollars to make the compiler as fast as possible, often as fast as assembly language. You should not second guess the compiler. But you can make it easier for the compiler with: alignment of data and using assume_aligned or SIMD ALIGNED at the do loops. Use constants for loop bounds and steps if possible. OR look into the hint directives like DIR$ LOOP COUNT to give hints to the compiler about how many iterations to expect - average, min or max. LOOP COUNT will help prevent serial versions of a loop (loop multi-versioning) to be created in some cases. the Alignment helps prevent PEEL loop creation. Constants can prevent REMAINDER loops if the loop trip count is a multiple of the vector length. All little helpers for the compiler.