- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
array(1:npts,1) = density
array(1:npts,2) = pressure
etc.
and the newer code is set up opposite in the rank ordering as
array(1,1:npts) = temperature
array(2,1:npts) = energy
etc.
The variable definitions (density, temperature, etc.) are different just to reinforce that they are not necessarily the same between codes. What I am trying to do is create a pointer to easily pass a reordered version of the array in the new code to an established procedure from the old code with its data ordering. As an example (I know this won't compile but it should illustrate the general idea)
pointer => (/ newarray(2,1:npts),newarray(5,1:npts),newarray(1,1:npts),etc. /)
call procedure(npts,pointer) !(assume number of variables is known in this procedure)
!
subroutine procedure(npts,oldarray)
real, dimension(npts,nvar), intent(inout) :: oldarray
I could reorder the variable definitions of the new code to match the old (would be a big pain), but due to the necessary algorithms of the new code, I cannot change the rank ordering. I know you can create a derived datatype pointer array but I'm pretty sure passing that to either an implicit or explicit interfaced procedure won't work as desired.
The whole goal here is to be able to elegantly and efficiently pass our array to the old procedure without having to allocate a temporary array just to reorder our data (our only current option which seems computationally and memory-ily(?) inefficient).
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hardware vectorization (SSE and now AVX) works only when the data is adjacent. Depending on how your code is written it may vectorize better one way or the other. If performance is an issue then it is recommended to arrange your data such that it easily vectorizes.
You will have to look at your code to see if your code predominantly has the inner most loop index running through the 1:npts or running through the other index. If 1:npts then I would suggest you rearange your code.
Note, you can use the Fortran PreProcessor to create a macro to swap the indicies on the array
#define array_swapped(x,y) array(y,x)
Depending on assumptions made in your code the #define may or may not work for you.
If that doesn't work then you will have to copy your data to the other storage format (and back if necessary).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What Jim Dempey's trying to tell you is that the older code is actually more efficient than the newer code (more so when it comes to vectorization).
Fortran differs from C-like programming languages in that arrays are contiguous column-wise (instead of C's row-wise). That means array(1,1) and array(2,1) are contiguous but array(1,1) and array(1,2) are npts apart (i.e., there are npts-1 entries between them). Without some sort of optimization, the following
[fortran]do i = 1, npts do j = 1, nvar array(i, j) = ... enddo enddo [/fortran]
Is slower than the following
[fortran]do j = 1, nvar do i = 1, npts array(i, j) = ... enddo enddo [/fortran]
So sticking to the older array is actually the "elegant" and "efficient" solution... And the necessary algorithms will benefit from adjusting to the ordering of the older code.
However, if you still insist, keep in mind that in your example, pointer is pointing to an expression, and as an expression it doesn't have the TARGET attribute ---so you'd need to store the expression in a variable with the TARGET attribute or creating a function whose result is a pointer to a target of the expression... Something like:
[fortran]implicit none integer :: i integer, target :: a(10) integer, pointer :: p(:) a = [(i, i = 1, 10)] p => dummy() print *, p contains function dummy() result(leaky) integer, pointer :: leaky(:) allocate (leaky(4)) leaky = [a(3:4), a(7:8)] end function end [/fortran]
That, because of the allocation involved, still means you'd need a temporary array... Using the TRANSPOSE intrinsic would be faster and have the same effect (including the use of a temporary array).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The reason I can't switch the ranks on the new code is because the structure of the new code is completely different. The old code generally performsa specific loop for each specific variable making its ordering the most efficient for its algorithms. However in the new code, most routines look somewhat like
do loop over faces
!
iel = left element on face
ier = right element on face
!
densityL = unkno(1,iel)
uvelL = unkno(2,iel)
vvelL = unkno(3,iel)
wvelL = unkno(4,iel)
energyL = unkno(5,iel)
!
densityR = unkno(1,ier)
uvelR = unkno(2,ier)
vvelR = unkno(3,ier)
wvelR = unkno(4,ier)
energyR = unkno(5,ier)
!
compute fluxes for face from variables
!
end of loop over faces
This ordering is necessary because the new code is for unstructured grids, where the most efficient way to access data and compute fluxes is over the faces. The old code is for structured grids and can take advantage of the ordered nature of the grid.
Sobasically our code is set up so the inner most rank is the one accessed in the inner most loop (although here it isn't really a loop). Also for help with memory access, the ordering of the elements and faces for the unstructured gridis optimized toavoid cache misses.
The main reason I am trying to prevent creating this temporary array is because it might be neededmany times over numerous routines for each time step. I could create a not-so-temporary array to use throughout but we are talking millions of double precision elements * 20+ variables. That redundant array just doesn't seem very appealing to me when it starts to get into the GB range.
We do have some pre-processing statements, mostly for the selection of a serial or parallel implementation, but it (the old code)currently uses cpp. Could you explain a little more about the macro (no experience here) and would that work with passing the array to the older procedures?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The Fortran preprocessor is similar to cpp. Usually, you add the -fpp switch to the compile line (or use a different file extension) and change any relevant INCLUDE to cpp-style "#include". In your code, you need to change the occurrences of "array" to "array_swapped", but only the ones where you want the macro to be applied. For example:
[fortran]#define array_swapped(x,y) array(y,x) implicit none integer :: i, j, array(5,5) do i = 1, 5 array_swapped(i,:i-1) = 0 do j = i, 5 array_swapped(i,j) = i * j enddo enddo print '(5I3)', (array(i,:), i = 1, 5) end [/fortran]To see what the preprocessor does, use the -save-temps switch. For further details, refer to the Preprocessing section in the compiler's documentation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
do loop over faces
!
iel = left element on face
ier = right element on face
!
densityL = unkno(1,iel)
uvelL = unkno(2,iel)
vvelL = unkno(3,iel)
wvelL = unkno(4,iel)
energyL = unkno(5,iel)
!
densityR = unkno(1,ier)
uvelR = unkno(2,ier)
vvelR = unkno(3,ier)
wvelR = unkno(4,ier)
energyR = unkno(5,ier)
!
compute fluxes for face from variables
!
end of loop over faces
The above code (if the preponderance of your run time) would indicate that you indeed want to reorganized the data.
*** However ***
Look at the indexing arrangement of iel and ier and the code in the comput fluxes.
By using unkno(iel,n) and unkno(ier,n) format where n has constant (1,2,3,4,5) values, then the loop variable indexes iel and ier will represent adjacent data as the loop(s) increment assuming sequential increments (code not shown). And in which case unkno(iel,n) and unkno(ier,n) may vectorize quite nicely on x64.
*** if iel and ier follow non-incrementing sequences then unkno(n,iel) and unkno(n,ier) "might" be better. You will have to run some tests to determine which way is better.
Not showing the loop indexing and array references makes it difficult for us to offer constructive advice.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
do loop over faces
process some of 5 properties for two faces
end do
do loop over faces
process remainder of 5 properties for two faces
end do
That the above code reduces register pressure and improves cache hit ratios
It will add the trivial compuation overhead of incrementing integer register variables one more time.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, the cells attached to a face are not necessarily consecutive, in fact, rarely will they be. If this was a structured grid, it would form a nice ordered sparse matrix that would definitely be better if the ranks were switched but with an unstructured grid the sparse matrix is very unordered.
Here are two sparse matrices from two very small simple 2D unstructured grids to illustrate the cells that are adjacent to a particular cell.
Remember these are 2D and a 3D grid would be a more realistic problem and be much more complex.
I haven't ever come across the term register pressure before. I did a quick search and this seems to be something that could have a significant affect on performance. Does that mean that I should try to limit (as much as possible) the number of variables being accessed in one specific loop to 16 integers and 16 doubles?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The CPU has a small number of registers. On Intel64 platform there are 16 integer registers, some reserved (instruction pointer, stack pointer,...), 8 legacy FPU floating point registers, and 16 SSE registers (128-bits wide for 4-wide floats, or 2-wide doubles, and can also be used for bytes, words, dwords, qwords), and if/when you upgrade to Sandy (or Ivy)Bridge with AVX the AVX registers are 256-bits wide (8-wide floats, 4-wide doubles, etc...)
What the wideness provides for you is if you can organize your program such that same operations can be performed on adjacent data, and if the compiler can recognize this,then the compiler will generate code to perform 2, 4, 8 arithmatic operations per instruction. It is very important to pay attention to laying out your data to take advantage of vectorization.
Back to register pressure. The compiler optimizations can generate code that will reassign local variables and redundant expressions into available registers (then reassign different local variables and/or redundant expressions to available registers). Depending on the data type the assignments are to the integer, FPU, SSE/AVX registers.
If you have a loop that manipulates a large number of data points per iteration then the register pressurebecomes which of these points and redundant expressions are to be placed into which register and then which get reassigned later in the loop. Too many points too few registers.
If you can identify data (and expressions manipulating those data) in the loop that are independent of other data and expressions (or very loosly dependent), then the loop can bedivide into two loops, each to be perform part of the computation. Then those two loops can each be examined to see if they can be divided,... Each reduction of code in the loop decreases the register pressure. N.B. I should mention that compiler optimizations has come a long way, and for some (a few) loops it will automaticaly breakup one loop into multiple loops.
When loop division is possible, then consider how you lay out your data such that it becomes suitable for vectorization.
Caution -- The decision for layout for vectorization cannot be made by looking at a single loop. You must look at the overall effect on the program as improving vectorization in one area can decrease vectorization in other areas. (no simple solution to this conflict of interests.)
For vectorization and register pressure consider
do i=1,n
do j=1,m
a(j,i) = b(j,i) * c(j,i) ! inner index first
end do
end do
In FORTRANadjacent indecies on theleft most index are usually in adjacent memory and therefore loops passing over those arrays in that manner (left most varying fastest)are good candidates for vectorization. IOW, if m=1000 and the array types were REAL(4) then the compiler may generate code to perform the inner statement on 4 variables at once and reduce the inner loop iteration count to 250 times. Then the compiler may unroll the loop, not to reduce the code expense of incrimenting and testing the loop counter, but to overlap the memory fetches and stores with multiplication. To do this bit of magic the compiler will use registers (SSE or AVX). pseudo code:
loop:
xmm0 = b(index) ! fetch 4 floats from b
xmm1 =c(index) !while above is fetching, fetch from c
xmm2 = b(index+4) ! fetch 4 floats from b
xmm3 =c(index+4) !while above is fetching, fetch from c
xmm4 = b(index+8) ! fetch 4 floats from b
xmm5 =c(index+8) !while above is fetching, fetch from c
xmm6 = b(index+12) ! fetch 4 floats from b
xmm7 =c(index+12) !while above is fetching, fetch from c
xmm0 = xmm0 * xmm1 !first fetch likely completeby now (data ready in xmm0 and xmm1)
! above is 4 multiplications at one time
xmm2= xmm2 * xmm3 ! while the above multiplication is working start the next
a(index) = xmm0 ! interleave store with multiply
xmm4 = xmm4 * xmm5 ! next multiply
a(index+4) = xmm2
xmm6 = xmm6 * xmm7
a(index+8) = xmm4
a(index+12) = xmm6
index = index + 16
if(index .lt. limit) goto loop
Something like the above will be generated.
Note that the FORTRAN do loop did not indicate use ofregisters but the compiler must generate code touse the registers. The above simple loop (pseudo code)was using 1/2 of the available SSE registers. Generally speaking, the more code in the loop, theharder it is to efficiently use the registers in a manner that overlays memory(cache) fetch and stores with arithmatic operations.
Paying attention to theseissues now (something you may not be eager to do now) could have an impact of 2x or more on performance of code. So you have to ask yourself: How many hours will you or your users be waiting for results? (Through the lifetime of this program.) Now weigh this againstyour efforts to do the program well.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page