- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suppose the first thing I would ask is why are you converting the code? If it ain't broke, don't fix it. Often, people who "modernize" Fortran code into what they think is good F90 misuse the language, adding assumed-shape arraysand array operations with abandon and wondering why things are slower. Part of the problem is what you THINK is a direct translation isn't. For example, the array assignment:
A = A + B
where A and B are arrays, is not semantically equivalent to:
DO I=1,N
A(I) = A(I)+B(I)
END DO
instead, it is:
Create temporary array TEMP
DO I=1,N
TEMP(I) = A(I)+B(I)
END DO
DO I=1,N
A(I) = TEMP(I)
END DO
Destroy TEMP
Compilers have to recognize where these loops can be fused and the temporary discarded - sometimes it is easy, sometimes not.
Assumed-shape arrays gives you flexibility you didn't have in F77, but there's no law saying you must use them. Use them where it makes sense, and not where it doesn't. In the case of arguments, you (the programmer) need to know where those are used as an explicit interface is required otherwise the program will not work. However, the compiler is pretty good in general at optimizing in the presence of pointers, and I've seen quite a few performance comparisons showing at most a 2-5% penalty for using assumed-shape arrays. so this should not be a matter of great concern.
I will agree that some of the things in the Fortran standard pose significant performance penalties, such as the Fortran 2003 way of doing array assignmments with allocatable arrays requiring the compiler to detect whether the left-hand-side is already allocated to the correct size and reallocating if it is not. The object-oriented features are also lilkely to be slow.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm not saying that these issues couldn't be resolved, but they would inevitably increase the complexity of assumed shape arrays. The committee instead decided to go with the simplest specification and TKR compatibility rule, hoping that compilers will eventually resolve the problems.
They have made a significant progress - I have recently (out of curiosity) developed a special code for calculating propeller performance in two synchronous versions - one with assumed shape arrays, and one with assumed size & explicit shape arrays. The code used many arrays up to 5 dimensions. Eventually I gave up with the assumed shape version - I needed to use slices of the arrays and the using storage associations for the assumed size got quite clunky. But the core computation already worked, and I was quite surprised that with Intel at -O3 -ip -whateverElseIdontRemember there was no measurable performance penalty for the assumed shape version.
Yes, assumed shape arrays, due to their runtime inquiring of dimensions, imply some "abstraction" penalty. Your case, which can generally be called "tiny leading dimension" is some of the worst cases. During the time I use F95, I gathered that the following might help:
1. when writing new, code, consider storing and passing the arrays "big dimension first", i.e. (n,3) instead of (3,n), or even the "matlabic" style xp(:),yp(:),zp(:) instead of p(:,:). You could be surprised that this is often an equally (or more, when, e.g., z-coordinates are not used at all) sensible approach, and enables you to use vector operations more heavily.
2. when you have an array p(:,:), and you know that size(p,1) == 3, write expressions such as
p(:,i) = 2*p(:,i) - p(:,j)
like this
p(1:3,i) = 2*p(1:3,i) - p(1:3,j)
(or p(1:3,i) = 2*p(:,i) - p(:,j) often suffices, but looks asymmetric to me)
such that the compiler does get the information that it is a 3-cycle loop and usually uses complete unrolling. I have seen cases when this thing alone reduced the performance penalty from 130% to 20%.
3. consider copying the parts of the arrays that are operated in inner loops to local automatic variables.
4. consider writing a "type wrapper" for the small vector, such as
TYPE vec3d
real:: x(3)
END TYPE
and then, use
type(vec3d),intent(inout):: p(:)
p(i)%x = 2*p(i)%x - p(j)%x
this is usually almost zero-penalty approach, only it disables using p as a 2d array (say, in a matrix multiplication). On the other hand, you may often find out that there is more to add to the type than coordinates, that improves locality of reference.
5. If you don't need to pass various slices around, consider avoiding assumed-shape arrays at all. Note that explicit shape and assumed size arrays are by no means obsolete features (assumed size was in F95, but this was corrected in F2003), neither is DO loop.
Hope this helps
Jaroslav
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I thinkJaroslav diverged from the original topic, but I have to comment on the question of p(3,n) vs p(n,3). It is true, evidently, that the later provides for more efficient vectorization on Xeon platforms. This is likely to be the most important consideration. However, assuming all 3 components are used together in a loop, it consumes 3 read or write combine buffers, which could become a problem when several such arrays are in use, on a Xeon platform.
Certain legacy codes actually did give some thought to this, and may have been designed to perform well on platforms with extremely small cache (by present standards).So, the considerations which were in force over 10 years ago are different from the present.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I realize that the flexibility offered by dynamic arrays lose a lot of optimization opportunities that you get by static dimensions. There can be a huge performance hit due to aliasing if pointer or target attributes are used. These are the two main things that made F77 faster than C.
In the example you give of "A(:)=A(:)+B(:)" versus a DO loop version, I don't see distinction. I would expect a modern optimizer to produce the same result in either case, which may or may not include a temporary array. In other words, a vectorized DO loop seems just as likely to generate a temporary array.
As for my question about not having a static first dimension for a 2-dimensional allocatable, such as '(3,:)', it seems that there were good reasons to avoid the extra complication. However, you can always pass the array through an assumed-size subroutine interface, where the subroutine can have a fixed first dimension.
Overall, I try to avoid unnecessary dynamic arrays and complexity that are easy in F90. I suppose what I'm really wondering which things have the largest effect. Am I avoiding some things that are not very significant, while missing something else that is significant?
Of course, it is all very compiler dependent, and likely to change when F2003 and F2008 features are added. Maybe the thing to do is to start putting together some benchmark code that tries multiple forms of the same thing. This would also make it easy to assess how optimization affects them.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
subroutine test1(a,b,n)
integer :: n
real :: a(n), b(n)
a=a+b
end subroutine test1
subroutine test2(a,b,n)
integer :: n
real :: a(n), b(n)
integer :: i
do i=1,n
a(i)=a(i)+b(i)
end do
end subroutine test2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems obvious that most of the new array stuff is targeted to more "Matlab-like" array programming, where an occassional temporary is something you can live with, and the innermost loops are mostly hidden in vector operations. This is IMHO why the intrinsics are array-valued functions, not subroutines with output arguments, and why assumed shape arrays are simple yet cause performance troubles in "low-level" code. The reward is that you can treat arrays quite like scalars, and let the compiler do its heavy optimizations. I suppose that this can also make Fortran attractive for commercial compiler vendors.
The cool thing is that great care has been taken to make new-style coding compatible with the old style, so that there are not inevitable performance losses. Most experts recommend to take advantage of modern features and write clear code, and then use profilers and other tools to find the hot spots and try to optimize those (or replace by tuned subprograms).
One of the defficiencies (IMHO) of F95/F2003 is that it is _impossible_ to tell whether an assumed shape array is contiguous (or partially contiguous), and write a specialized code for the case, if it happens to be necessary. But note that most compiler vendors (including Intel) offer extensions like LOC intrinsic that can be used to resolve this problem. Intel is, in particular, great at optimizing modern Fortran code; nevertheless, it offers a plethora of extensions to make code faster should one feel smarter than the compiler.
Many thanks to Intel for their Fortran compiler!
Jaroslav
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page