Alignment and qopt-assume-safe-padding

JeffS · ‎08-05-2023

I'm working with an older fortran simulation code and trying to modernize and improve its performance where I can. The thing I am looking at right now is that the program uses one large allocatable vector (call it C) to house ~15, large 2D matrices using indexing offsets. In most places the array segments are passed into subroutines as assumed size 1D vectors.

My thought is that it would be beneficial to align each array within C using compiler directives and/or flags (!dir$ attributes align:64 ::). My initial idea was to make a type that contains the arrays as allocatables and use the directive to align the first element to 64 byte boundaries for each variable. I think this should be fairly straight forward, other than having to also use the assume aligned directive in any subroutines that individual variables are passed into. This would align the start of the array correctly.

What about the ability to align the end of the array by adding padding? There is the -qopt-assume-safe-padding compilation flag, but I don't see a clean way to implement it with allocatable 2D arrays?

Typing this out it now... To get alignment of first element and have padding at the end.... It seems easiest to keep all of these 2D arrays in the large 1D array, keep the indexing scheme, but add extra padding between variables in C such that:

The first element of each subarray is aligned on 64 byte boundary
There is at least 64 bytes of garbage space in the overall array before the next array so assume-safe-padding can be used.

Any thoughts on the above?

Some links I looked over about this...

https://www.intel.com/content/www/us/en/developer/articles/technical/utilizing-full-vectors.html

https://community.intel.com/t5/Intel-Fortran-Compiler/Issue-with-the-alignment-of-the-components-of-derived-types/m-p/1072598

jimdempseyatthecove · ‎08-08-2023

Fortran allocation is such that an entire allocation is contiguous be it 1D or 2D (multi-dimensional). Fortran stores as:

array(col,row) ! (minor index, major index)

C/C++ allocation of a multi-dimensional, say 2D, is that the base pointer points to an array of pointers, each which can be separately allocated .OR. be place holders within a single allocation. C/C++ stores as:

array[row][col] // [major index][minor index]

Therefore, if you desire to have the start of each major index on an aligned boundary then

Fortran: attribute the array for the alignment desired .AND. assure that the size of the minor index is a multiple of the desired alignment (iow place the pad here).

C/C++: allocate each major index separately with alignment desired .OR. aligned allocate a single blob with size of the (number of minor indexes rounded up to next multiplied of desired alignment) * number of major indexes (then build the table of pointers for the major index).

Note, in both cases, the programmer must independently know the number of minor-indicies contain valid data.

*** .AND. in Fortran sections of code you must code with this knowledge. IOW SIZE(array, 1) contains the count of data elements plus pad if any as opposed to the count of data elements in the row.

This is to say in both languages it is your coding responsibility to attain your desired alignments.

Jim Dempsey

JeffS · ‎08-09-2023

Thanks Jim. That all makes sense.

In regards to the code I am working with that is storing ~15 arrays in a one large variable.... I made a small function which takes in desired alignment (64 bytes), bytes per element, and desired index. It rounds the desired index up to the next byte aligned index in the overall array. At the point of use I am used assume aligned directive and it seems to be working. Implementation was pretty easy because of how the code was originally put together with all of the offsets. Tracking down all the locations for the directives will be a bit annoying though...

As far as getting column slices to align I might look at manipulating the actual number of unknowns when the problem is formulated. Not sure I really have the time to delve into that though. The code is hard to follow.

jimdempseyatthecove · ‎08-10-2023

Consider using C_LOC together with C_F_POINTER to construct a Fortran array descriptor.

...
real, allocatable :: Blob(:) ! big blob holding several arrays
...
real, pointer :: pSome2Darray(:,:)
! fill the following in:
integer :: offsetSome2Darray, sizeWithPaddDim1Some2Darray, sizeDim2Some2Darray 
...
call C_F_POINTER( &
  C_LOC(Blob(offsetSome2Darray)), &
  pSome2Darray, &
  [sizeWithPaddDim1Some2Darray, sizeDim2Some2Darray])
...
val = expr + pSome2Darray(j, i)
...

be aware that SIZE(pSome2Darray, 1) returns the padded size of dimension 1 and not the used size (you can create a seperate variable for holding this value. More likely you will have a user defined type containing the appropriate values and pointer. This will make initialization much easier.

Jim Dempsey

JeffS · ‎08-11-2023

I figured out how to change the number of unknowns during the problem setup such that my columns will be byte aligned.....I think. I still have to check the end result for consistency and that I haven't ballooned the actual problem size too much. The size I had to find was the next common multiple of my byte alignment factor and another number. In some cases I could see it increasing the unknowns by a lot, But for problems that are already large I don't think it will make too much difference. At the point of use I will use "!DIR$ ASSUME (MOD(<ROWS>,8) .EQ. 0)" to let the compiler know the columns are aligned. I am hoping to see a better looking opt-report file.

Interesting on the C_LOC and C_F_POINTER. I've never tried to use those before. For this problem I think it might be difficult to implement. Although the blob is storing a lot of "2D" arrays, many of the routines index into them as if they are vectors. The indexing isnt a simple 1:n unfortunately. I would think the column padding would get in the way there. Its definitely a technique I may consider in some hotspots in the code though.

This is all in an attempt to speed up a particularly slow function with 4 nested do loops. I may make separate post to gather any thoughts on speeding that up.

jimdempseyatthecove · ‎08-14-2023

>>This is all in an attempt to speed up a particularly slow function with 4 nested do loops. I may make separate post to gather any thoughts on speeding that up.

I had a very nice (comprehensive) article posted on the intel Articles archive... but it is not available or reachable using the prior linkage. The title had "peel-the-onion". A summary is attached

The principle to follow is upon observing non-optimal code generation on multi-dimentional code generation, you can regain optimization by peeling the onion (so to say). What this means is if you have as an example thee dimensional arrays, indexed using three indicies within a loop where the code generation is found to be non-optimal, optimization may be obtained by peeling the right most index off of the array(s) by passing a two dimension slice of the three dimensioned array(s), and if that not yield the oprimination, peel a one dimension slice off the three dimensioned array(s), or the once peeled two dimensioned array.

do k=1, nK
  do j=1, nJ
    do i=1, nI
       ...
       result(i,j,k) = nonOptimalExpression(i,j,k)
       ...
    end do
  end do
end do

! becomes:
do k=1, nK
  call peelK(result(:,:,k), someArray(:,:,k), otherArray(:,:,k)
end do
...
contains
subroutine peelK(result, someArray, otherArray)
real, dimension(:,:) :: result, someArray, otherArray
  do j=1, nJ
    do i=1, nI
       ...
       result(i,j) = nonOptimalExpression(i,j)
       ...
    end do
  end do

Or, if necessary, peel off two of the right most indicies, either in the original main code or within the peelK subroutine.

Sometimes the compiler needs a little assistance.

*** Please note, the attached excerpt has a much higher degree of optimization considerations in the program changes...

... as well as the fruits of much higher extent optimization attained.

I suggest you start with the above, then consider what is presented in the attached document.

Jim Dempsey