A loop was not parallized but Vectorized ?

AThar2 · ‎03-15-2019

Given the following simple code

!DEC$ ATTRIBUTES FORCEINLINE :: RESET
      elemental subroutine RESET(this )
!DIR$ ATTRIBUTES VECTOR :: RESET
      implicit none

      class(t_sources), intent(inout)        :: this

!---- local
      integer                               :: i

!DIR$ VECTOR ALIGNED
      this% mass_n     = this% mass
!DIR$ VECTOR ALIGNED
      this% mom_n(:,1) = this% mom(:,1)
!DIR$ VECTOR ALIGNED
      this% mom_n(:,2) = this% momentum(:,2) 
!DIR$ VECTOR ALIGNED
      this% mom_n(:,3) = this% mom(:,3)

!---  initialise array
!DIR$ VECTOR ALIGNED
      this% mass      = 0.
!DIR$ VECTOR ALIGNED
      this% mom(:,1) = 0.
!DIR$ VECTOR ALIGNED
      this% mom(:,2) = 0.
!DIR$ VECTOR ALIGNED
      this% mom(:,3) = 0.

I get the following optimisation report (SHOWN for line 16 but the same principle goes for the remaining)

LOOP BEGIN at test.f(16,7)
   remark #25399: memcopy generated
   remark #17104: loop was not parallelized: existence of parallel dependence
   remark #17106: parallel dependence: assumed OUTPUT dependence between this(:,2) (16:7) and this(:,2) (16:7)
   remark #17106: parallel dependence: assumed OUTPUT dependence between this(:,2) (16:7) and this(:,2) (16:7)
   remark #15542: loop was not vectorized: inner loop was already vectorized


   LOOP BEGIN at tes.f(16,7)
      remark #15388: vectorization support: reference this(:,2) has aligned access
      remark #15388: vectorization support: reference this(:,2) has aligned access
      remark #15305: vectorization support: vector length 4
      remark #15300: LOOP WAS VECTORIZED
      remark #15448: unmasked aligned unit stride loads: 1 
      remark #15449: unmasked aligned unit stride stores: 1 
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 4 
      remark #15477: vector cost: 0.750 
      remark #15478: estimated potential speedup: 5.330 
      remark #15488: --- end vector cost summary ---
      remark #25015: Estimate of max trip count of loop=3
   LOOP END

Question A) : How shall I interpret the difference between that the loop was not parallelised but vectorized? - I am compiling with -parallel but that would enable automatic vectorization? Can somebody please explain me the difference between these two reports

Question B) : As you see I have a the directive `!DIR$ ATTRIBUTES VECTOR :: RESET` - I am not sure when to use and when not to use it. I now it means that the function becomes vectorized but would that differ from not having the directive but rather having a !omp simd?

Thanks very much in advance

jimdempseyatthecove · ‎03-15-2019

Apparently you enabled auto-parallelism (as opposed to using OpenMP directives). Excepting for relatively simple programs, auto-parallelism tends to be too ignorant about the effectiveness as to if to parallelize or not parallelize. Excepting for relatively simple programs, if you desire to use parallelization, then introduce OpenMP directives into the program at the points where it makes sense to do so.

When you have ArrayX(:,i) = ArrayY(:,j), this is copy a row from Y to a row of X...

With auto-parallelization enabled, the compiler will determine if the assignment statement could potentially be parallelized, and if so, insert code to ascertain at runtime (inside parallel version of memcopy) if the copy operation can benefit from parallelization, and if so do so. Note this is performed regardless of any other parallelization (OpenMP) that you have introduced into your code, and as such may be counter-productive.

Parallelization is different from vectorization Parallelization, for example, will partition an iteration space for execution by multiple software threads (generally each on a different hardware thread). Vectorization, which can be used by either the non-parallel method or by each of the parallel threads of the separate partitions of the iteration space, can process multiple array elements in each instruction. Both parallelization and vectorization can have loop dependency issues. For parallelization this can occur at the juncture of the partitions, and for vectorization at the (potential) confluence of adjacent vectors.

If the compiler cannot ascertain if there won't be a conflict, then it will avoid vectorization and/or parallelization.
*** Note, if you explicitly state "do it anyway", the compiler will comply (at your own risk).
*** Note 2, the compiler isn't always correct, but it tends to err on the conservative side (giving you less performance than you might otherwise attain)

Jim Dempsey

TimP · ‎03-16-2019

You may need to tell us specifically which messages you want explained, and possibly to refer to the full opt_report=4. inner loop was already vectorized normally refers to the situation of nested loops, where this message confirms the desired result that only the inner loop is vectorized. The compiler could have found a way to combine loops which you didn't foresee. For example, it appears that you could have used

this% mom = 0

which causes the compiler to choose a single memset library function for the entirety, and not report vectorization,.

The compiler appears to have decided that this%mom_n and this%momentum may overlap in an unfavorable way.

As Jim mentioned, apparently 2 copies of that loop were generated, with the overlaps checked at run time, so as to choose between the aggressively optimized and the conservative copy. We can't see from what you show whether there is likely to be an advantage in persuading the compiler not to make the 2 copies.

AThar2 · ‎03-17-2019

jimdempseyatthecove wrote:
Apparently you enabled auto-parallelism (as opposed to using OpenMP directives). Excepting for relatively simple programs, auto-parallelism tends to be too ignorant about the effectiveness as to if to parallelize or not parallelize. Excepting for relatively simple programs, if you desire to use parallelization, then introduce OpenMP directives into the program at the points where it makes sense to do so.
When you have ArrayX(:,i) = ArrayY(:,j), this is copy a row from Y to a row of X...
With auto-parallelization enabled, the compiler will determine if the assignment statement could potentially be parallelized, and if so, insert code to ascertain at runtime (inside parallel version of memcopy) if the copy operation can benefit from parallelization, and if so do so. Note this is performed regardless of any other parallelization (OpenMP) that you have introduced into your code, and as such may be counter-productive.
Parallelization is different from vectorization Parallelization, for example, will partition an iteration space for execution by multiple software threads (generally each on a different hardware thread). Vectorization, which can be used by either the non-parallel method or by each of the parallel threads of the separate partitions of the iteration space, can process multiple array elements in each instruction. Both parallelization and vectorization can have loop dependency issues. For parallelization this can occur at the juncture of the partitions, and for vectorization at the (potential) confluence of adjacent vectors.
If the compiler cannot ascertain if there won't be a conflict, then it will avoid vectorization and/or parallelization.
*** Note, if you explicitly state "do it anyway", the compiler will comply (at your own risk).
*** Note 2, the compiler isn't always correct, but it tends to err on the conservative side (giving you less performance than you might otherwise attain)
Jim Dempsey

Thanks Jim. That cleared it out for me. I initially thought that -parallel enabled autovectorisation but not auto parallelization. However, I don't want the compiler to auto-parallelise since this will be taken care of explicitly by the programmer (by MPI - and this is both of cores/threads and nodes). I now have removed the parallel flag.

AThar2 · ‎03-17-2019

Tim P. wrote:
You may need to tell us specifically which messages you want explained, and possibly to refer to the full opt_report=4. inner loop was already vectorized normally refers to the situation of nested loops, where this message confirms the desired result that only the inner loop is vectorized. The compiler could have found a way to combine loops which you didn't foresee. For example, it appears that you could have used
this% mom = 0
which causes the compiler to choose a single memset library function for the entirety, and not report vectorization,.
The compiler appears to have decided that this%mom_n and this%momentum may overlap in an unfavorable way.
As Jim mentioned, apparently 2 copies of that loop were generated, with the overlaps checked at run time, so as to choose between the aggressively optimized and the conservative copy. We can't see from what you show whether there is likely to be an advantage in persuading the compiler not to make the 2 copies.

Hello Tim P.

Thanks for your reply. I set the report flag to 5 so that should give all information.
So the reason why I did not do 'this% mom = 0' was because the compiler was still reporting unaligned access even though I had the vec align flag.
What I ended up doing instead is

!VEC ALIGN

do i = 1,size_     ! size_ = no. of columns in this% mom
    this% mom(:,i) = 0.
enddo

That removed the unaligned access report. The question if that is efficient even though there is no longer unaligned accesses and it is vectorized.

jimdempseyatthecove · ‎03-17-2019

>>by MPI - and this is both of cores/threads and nodes

When you have time, I suggest you consider programming MPI across nodes, and OpenMP within node. OpenMP has significantly lower latencies between threads within node than MPI has within node.

Jim Dempsey

AThar2 · ‎03-17-2019

Hello Jim,

Thanks for your advice.

While you are absolutely right, I have come across many papers and there seems to be a consensus in the community I work in that OpenMP does not scale up well at higher number of cores.

Having said this, MPI has "relatively recently" introduced the hybrid way where it seems like similar features of OpenMP has been embedded in MPI - one-sided communication.

I am happy to see/read any other views on this matter

AThar2 · ‎03-17-2019

Okay after some basics tests i realized a few things.

When I did :

do i = 1,size_ ! size_ = no. of columns in this% mom 

!DIR$ VECTOR ALIGNED

  this% mom(:,i) = 0. 

enddo

Yes, the compiler said that mom(:,:) has aligned access, HOWEVER, I get a seg. fault when running it. I suppose because it does not align properly my 2D array, this%mom(:,1) is aligned but not from this%mom(:,2) till end.

The confusing bit is that when I allocate with 3 rows (as initially did,i.e. this% mom(1:3,:)) the compiler seems to struggle aligning everything at 32 byte. However, when having 4,6,8,9,10 it is properly aligned. While 3,5 and 11 are not. (I only tested with these figures). It seems that when they are multiple of some number they can be aligned otherwise not?

In other words, when having 4,6,8,9,10 rows the optimization report says that the array has aligned access but with 3,5,11 says they are unaligned.

jimdempseyatthecove · ‎03-17-2019

>>I allocate with 3 rows (as initially did,i.e. this% mom(1:3,:)) the compiler seems to struggle aligning everything at 32 byte.

Only the first element of the array is aligned.

When you have an array of n groups of 3 (X, Y, Z), then depending on how you use the data, you will either allocate as:

(3,n) or (n,3)

And for the second allocation (n,3) you would allocate to (n+padd,3) where padd is 0 to number of cells-1 in cache line

IOW padd = (byteSizeInCacheLine / sizeof(elementInArray)) - mod(n, (byteSizeInCacheLine / sizeof(elementInArray)))

In this manner, cells (1,1), (1,2) and (1,3) are aligned, thus permitting you to operate on all the X's, or Y's, or Z's as vectors.

Jim Dempsey

AThar2 · ‎03-18-2019

Oh I see, Thanks Jim. I have always tried to allocate in a way where I now that these components are frequently used together in one loop iterations.

For example, I do allocate(uvar(3,ncell)) where uvar has velocity components in x,y,z. In C/C++ I would have gone the opposite way because that I have been learnt, that Fortran structure memory by first going through the rows.

Now when I am optimising my code to target vectorisation I am dealing with the alignment issue and from what I understand from your reply it seems to go the other way around.

What is your opinion on allocating three different variables (1D arrays) with u,v,w instead of uvar(3,ncell). Would be the best way to satisfy both issues?

Since I am quite new to this I even do not know how much of a speedup it provides when having aligned the data for big applications, although I do keep reading from articles it is a very important element in vectorisation.

jimdempseyatthecove · ‎03-18-2019

When you interact infrequently amongst objects, the order of the index can favor (3,N).... This places X, Y and Z together (in C indexing the proximity order is reversed).

However, in particle simulation, e.g. mass, position, velocity and force or acceleration (charge, etc...) one typically computes one particle verses remaining particles. For this type of simulation, the organization by Pos(N,3) or X(N), Y(N), Z(N), dX(N),... facilitates vectorization. Meaning all the particles X values are contiguous, Y values are contiguous, ...

The allocation of the 3 (6, 9) arrays will eliminate the need to insert a padd. And aids the compiler optimization in making decisions.

See: https://software.intel.com/en-us/articles/peel-the-onion-optimization-techniques

Jim Dempsey