Using pointer to pass 2D array as 1D reordered array

scspiegel · ‎06-21-2011

I'm working with two CFD codes and I am trying to integrate the newer code into the older code. However, the older code has its arrays structured

array(1:npts,1) = density
array(1:npts,2) = pressure
etc.

and the newer code is set up opposite in the rank ordering as

array(1,1:npts) = temperature
array(2,1:npts) = energy
etc.

The variable definitions (density, temperature, etc.) are different just to reinforce that they are not necessarily the same between codes. What I am trying to do is create a pointer to easily pass a reordered version of the array in the new code to an established procedure from the old code with its data ordering. As an example (I know this won't compile but it should illustrate the general idea)

pointer => (/ newarray(2,1:npts),newarray(5,1:npts),newarray(1,1:npts),etc. /)
call procedure(npts,pointer) !(assume number of variables is known in this procedure)
!
subroutine procedure(npts,oldarray)
real, dimension(npts,nvar), intent(inout) :: oldarray

I could reorder the variable definitions of the new code to match the old (would be a big pain), but due to the necessary algorithms of the new code, I cannot change the rank ordering. I know you can create a derived datatype pointer array but I'm pretty sure passing that to either an implicit or explicit interfaced procedure won't work as desired.

The whole goal here is to be able to elegantly and efficiently pass our array to the old procedure without having to allocate a temporary array just to reorder our data (our only current option which seems computationally and memory-ily(?) inefficient).

jimdempseyatthecove · ‎06-21-2011

In FORTRAN, the left most array index represents the closest data. IOW array(i,j) and array(i+1,j) represent adjacent data (unless explicity declared with a stride other than 1).

Hardware vectorization (SSE and now AVX) works only when the data is adjacent. Depending on how your code is written it may vectorize better one way or the other. If performance is an issue then it is recommended to arrange your data such that it easily vectorizes.

You will have to look at your code to see if your code predominantly has the inner most loop index running through the 1:npts or running through the other index. If 1:npts then I would suggest you rearange your code.

Note, you can use the Fortran PreProcessor to create a macro to swap the indicies on the array

#define array_swapped(x,y) array(y,x)

Depending on assumptions made in your code the #define may or may not work for you.

If that doesn't work then you will have to copy your data to the other storage format (and back if necessary).

Jim Dempsey

John4 · ‎06-21-2011

What Jim Dempey's trying to tell you is that the older code is actually more efficient than the newer code (more so when it comes to vectorization).

Fortran differs from C-like programming languages in that arrays are contiguous column-wise (instead of C's row-wise). That means array(1,1) and array(2,1) are contiguous but array(1,1) and array(1,2) are npts apart (i.e., there are npts-1 entries between them). Without some sort of optimization, the following

[fortran]do i = 1, npts
   do j = 1, nvar
      array(i, j) = ...
   enddo
enddo
[/fortran]

Is slower than the following

[fortran]do j = 1, nvar
   do i = 1, npts
      array(i, j) = ...
   enddo
enddo
[/fortran]

So sticking to the older array is actually the "elegant" and "efficient" solution... And the necessary algorithms will benefit from adjusting to the ordering of the older code.

However, if you still insist, keep in mind that in your example, pointer is pointing to an expression, and as an expression it doesn't have the TARGET attribute ---so you'd need to store the expression in a variable with the TARGET attribute or creating a function whose result is a pointer to a target of the expression... Something like:

[fortran]implicit none
integer :: i
integer, target :: a(10)
integer, pointer :: p(:)

a = [(i, i = 1, 10)]
p => dummy()
print *, p

contains
    function dummy() result(leaky)
        integer, pointer :: leaky(:)
        allocate (leaky(4))
        leaky = [a(3:4), a(7:8)]
    end function        
end
[/fortran]

That, because of the allocation involved, still means you'd need a temporary array... Using the TRANSPOSE intrinsic would be faster and have the same effect (including the use of a temporary array).

scspiegel · ‎06-22-2011

The reason I can't switch the ranks on the new code is because the structure of the new code is completely different. The old code generally performsa specific loop for each specific variable making its ordering the most efficient for its algorithms. However in the new code, most routines look somewhat like

do loop over faces
!
iel = left element on face
ier = right element on face
!
densityL = unkno(1,iel)
uvelL = unkno(2,iel)
vvelL = unkno(3,iel)
wvelL = unkno(4,iel)
energyL = unkno(5,iel)
!
densityR = unkno(1,ier)
uvelR = unkno(2,ier)
vvelR = unkno(3,ier)
wvelR = unkno(4,ier)
energyR = unkno(5,ier)
!
compute fluxes for face from variables
!
end of loop over faces

This ordering is necessary because the new code is for unstructured grids, where the most efficient way to access data and compute fluxes is over the faces. The old code is for structured grids and can take advantage of the ordered nature of the grid.

Sobasically our code is set up so the inner most rank is the one accessed in the inner most loop (although here it isn't really a loop). Also for help with memory access, the ordering of the elements and faces for the unstructured gridis optimized toavoid cache misses.

The main reason I am trying to prevent creating this temporary array is because it might be neededmany times over numerous routines for each time step. I could create a not-so-temporary array to use throughout but we are talking millions of double precision elements * 20+ variables. That redundant array just doesn't seem very appealing to me when it starts to get into the GB range.

We do have some pre-processing statements, mostly for the selection of a serial or parallel implementation, but it (the old code)currently uses cpp. Could you explain a little more about the macro (no experience here) and would that work with passing the array to the older procedures?

John4 · ‎06-22-2011

The Fortran preprocessor is similar to cpp. Usually, you add the -fpp switch to the compile line (or use a different file extension) and change any relevant INCLUDE to cpp-style "#include". In your code, you need to change the occurrences of "array" to "array_swapped", but only the ones where you want the macro to be applied. For example:

[fortran]#define array_swapped(x,y) array(y,x)

implicit none

integer :: i, j, array(5,5)

do i = 1, 5
    array_swapped(i,:i-1) = 0
    do j = i, 5
        array_swapped(i,j) = i * j
    enddo
enddo

print '(5I3)', (array(i,:), i = 1, 5)
end
[/fortran]

To see what the preprocessor does, use the -save-temps switch. For further details, refer to the Preprocessing section in the compiler's documentation.

jimdempseyatthecove · ‎06-22-2011

Quoting scspiegel

do loop over faces
!
iel = left element on face
ier = right element on face
!
densityL = unkno(1,iel)
uvelL = unkno(2,iel)
vvelL = unkno(3,iel)
wvelL = unkno(4,iel)
energyL = unkno(5,iel)
!
densityR = unkno(1,ier)
uvelR = unkno(2,ier)
vvelR = unkno(3,ier)
wvelR = unkno(4,ier)
energyR = unkno(5,ier)
!
compute fluxes for face from variables
!
end of loop over faces

The above code (if the preponderance of your run time) would indicate that you indeed want to reorganized the data.

*** However ***

Look at the indexing arrangement of iel and ier and the code in the comput fluxes.

By using unkno(iel,n) and unkno(ier,n) format where n has constant (1,2,3,4,5) values, then the loop variable indexes iel and ier will represent adjacent data as the loop(s) increment assuming sequential increments (code not shown). And in which case unkno(iel,n) and unkno(ier,n) may vectorize quite nicely on x64.

*** if iel and ier follow non-incrementing sequences then unkno(n,iel) and unkno(n,ier) "might" be better. You will have to run some tests to determine which way is better.

Not showing the loop indexing and array references makes it difficult for us to offer constructive advice.

Jim Dempsey

jimdempseyatthecove · ‎06-22-2011

I forgot to add. Your code sketch indicates you are processing 5 properties for two faces per loop iteration. This requies at least 10 variables (high register pressure amonst 16 available). You will find that if you can structure

do loop over faces
process some of 5 properties for two faces
end do
do loop over faces
process remainder of 5 properties for two faces
end do

That the above code reduces register pressure and improves cache hit ratios

It will add the trivial compuation overhead of incrementing integer register variables one more time.

Jim Dempsey

scspiegel · ‎06-23-2011

I used 5 properties per cell just as a simple example. In reality, there are n number of chemical species, a summation of the species densities, 3 velocities, energy, temperature, pressure, enthalpy, and speed of sound. However, that is just for a simple inviscid flow and viscosity adds another two properties and turbulence another three on top of those. So depending on the number of species required, we are generally talking 15-20 propertiesper cell, if not more.

Also, the cells attached to a face are not necessarily consecutive, in fact, rarely will they be. If this was a structured grid, it would form a nice ordered sparse matrix that would definitely be better if the ranks were switched but with an unstructured grid the sparse matrix is very unordered.

Here are two sparse matrices from two very small simple 2D unstructured grids to illustrate the cells that are adjacent to a particular cell.

Remember these are 2D and a 3D grid would be a more realistic problem and be much more complex.

I haven't ever come across the term register pressure before. I did a quick search and this seems to be something that could have a significant affect on performance. Does that mean that I should try to limit (as much as possible) the number of variables being accessed in one specific loop to 16 integers and 16 doubles?

jimdempseyatthecove · ‎06-23-2011

Register pressure (rough explanation)

The CPU has a small number of registers. On Intel64 platform there are 16 integer registers, some reserved (instruction pointer, stack pointer,...), 8 legacy FPU floating point registers, and 16 SSE registers (128-bits wide for 4-wide floats, or 2-wide doubles, and can also be used for bytes, words, dwords, qwords), and if/when you upgrade to Sandy (or Ivy)Bridge with AVX the AVX registers are 256-bits wide (8-wide floats, 4-wide doubles, etc...)

What the wideness provides for you is if you can organize your program such that same operations can be performed on adjacent data, and if the compiler can recognize this,then the compiler will generate code to perform 2, 4, 8 arithmatic operations per instruction. It is very important to pay attention to laying out your data to take advantage of vectorization.

Back to register pressure. The compiler optimizations can generate code that will reassign local variables and redundant expressions into available registers (then reassign different local variables and/or redundant expressions to available registers). Depending on the data type the assignments are to the integer, FPU, SSE/AVX registers.

If you have a loop that manipulates a large number of data points per iteration then the register pressurebecomes which of these points and redundant expressions are to be placed into which register and then which get reassigned later in the loop. Too many points too few registers.

If you can identify data (and expressions manipulating those data) in the loop that are independent of other data and expressions (or very loosly dependent), then the loop can bedivide into two loops, each to be perform part of the computation. Then those two loops can each be examined to see if they can be divided,... Each reduction of code in the loop decreases the register pressure. N.B. I should mention that compiler optimizations has come a long way, and for some (a few) loops it will automaticaly breakup one loop into multiple loops.

When loop division is possible, then consider how you lay out your data such that it becomes suitable for vectorization.

Caution -- The decision for layout for vectorization cannot be made by looking at a single loop. You must look at the overall effect on the program as improving vectorization in one area can decrease vectorization in other areas. (no simple solution to this conflict of interests.)

For vectorization and register pressure consider

do i=1,n
do j=1,m
a(j,i) = b(j,i) * c(j,i) ! inner index first
end do
end do

In FORTRANadjacent indecies on theleft most index are usually in adjacent memory and therefore loops passing over those arrays in that manner (left most varying fastest)are good candidates for vectorization. IOW, if m=1000 and the array types were REAL(4) then the compiler may generate code to perform the inner statement on 4 variables at once and reduce the inner loop iteration count to 250 times. Then the compiler may unroll the loop, not to reduce the code expense of incrimenting and testing the loop counter, but to overlap the memory fetches and stores with multiplication. To do this bit of magic the compiler will use registers (SSE or AVX). pseudo code:

loop:
xmm0 = b(index) ! fetch 4 floats from b
xmm1 =c(index) !while above is fetching, fetch from c
xmm2 = b(index+4) ! fetch 4 floats from b
xmm3 =c(index+4) !while above is fetching, fetch from c
xmm4 = b(index+8) ! fetch 4 floats from b
xmm5 =c(index+8) !while above is fetching, fetch from c
xmm6 = b(index+12) ! fetch 4 floats from b
xmm7 =c(index+12) !while above is fetching, fetch from c
xmm0 = xmm0 * xmm1 !first fetch likely completeby now (data ready in xmm0 and xmm1)
! above is 4 multiplications at one time
xmm2= xmm2 * xmm3 ! while the above multiplication is working start the next
a(index) = xmm0 ! interleave store with multiply
xmm4 = xmm4 * xmm5 ! next multiply
a(index+4) = xmm2
xmm6 = xmm6 * xmm7
a(index+8) = xmm4
a(index+12) = xmm6
index = index + 16
if(index .lt. limit) goto loop

Something like the above will be generated.

Note that the FORTRAN do loop did not indicate use ofregisters but the compiler must generate code touse the registers. The above simple loop (pseudo code)was using 1/2 of the available SSE registers. Generally speaking, the more code in the loop, theharder it is to efficiently use the registers in a manner that overlays memory(cache) fetch and stores with arithmatic operations.

Paying attention to theseissues now (something you may not be eager to do now) could have an impact of 2x or more on performance of code. So you have to ask yourself: How many hours will you or your users be waiting for results? (Through the lifetime of this program.) Now weigh this againstyour efforts to do the program well.

Jim Dempsey

TimP · ‎06-23-2011

In addition to the question of register pressure for loops which combine many operations, there is the question of how many simultaneous read and write streams (distinct sequences of cache lines) are supported by the hardware. Write performance begins to degrade beyond 8 streams (from Woodcrest on). If you store at most 8 results per loop trip, with sequential storage (first subscript incrementing by 1), there would be no problem. Beyond that, you should pay closer attention.