Re: Why is Fortran9x slower than Fortran77?

joseph-krahn · ‎02-23-2007

Fortran Standards have recently added some sane features that start to make it useful as a real programming language, not just a number cruncher. However, I am always surprised that converting ugly old F77 code to decent F9x almost always makes it slower, even in cases where the F77 logic looks rather convoluted. Some of this comes from fixed array dimensions being faster than variable sized arrays. It seems obvious that there should have been a dynamic array size that only varies in the last dimension, so that xyz coordinates can be dimension(3,:). You can work around this using a procedure with assumed size dimension(3,*). Sometimes this makes things faster. Another thing that I wonder about is possible overhead from generating array descriptors. Depending on the types of actual and dummy arguments, a descriptor might be required instead of a simple memory reference. How much overhead can this add? Maybe it would be good to have a compiler flag that warns when this occurs, like the array temp warning. It may not have a big effect in most cases, but it could add up for a procedure called many times. Are there other places where F9x adds overhead? I hope that sub-classed types will be implemented with speed in mind; they could potentially add a lot of overhead as well.

joseph-krahn · ‎02-23-2007

AARGH: I used the plain-text editor to avoid Firefox/Java conflicts with one of the fancy editor interface. It jammed multiple paragraphs into one big paragraph. I'll try a different editor interface next time; there are several.

Steven_L_Intel1 · ‎02-23-2007

I suppose the first thing I would ask is why are you converting the code? If it ain't broke, don't fix it. Often, people who "modernize" Fortran code into what they think is good F90 misuse the language, adding assumed-shape arraysand array operations with abandon and wondering why things are slower. Part of the problem is what you THINK is a direct translation isn't. For example, the array assignment:

A = A + B

where A and B are arrays, is not semantically equivalent to:

DO I=1,N
A(I) = A(I)+B(I)
END DO

instead, it is:

Create temporary array TEMP
DO I=1,N
TEMP(I) = A(I)+B(I)
END DO
DO I=1,N
A(I) = TEMP(I)
END DO
Destroy TEMP

Compilers have to recognize where these loops can be fused and the temporary discarded - sometimes it is easy, sometimes not.

Assumed-shape arrays gives you flexibility you didn't have in F77, but there's no law saying you must use them. Use them where it makes sense, and not where it doesn't. In the case of arguments, you (the programmer) need to know where those are used as an explicit interface is required otherwise the program will not work. However, the compiler is pretty good in general at optimizing in the presence of pointers, and I've seen quite a few performance comparisons showing at most a 2-5% penalty for using assumed-shape arrays. so this should not be a matter of great concern.

I will agree that some of the things in the Fortran standard pose significant performance penalties, such as the Fortran 2003 way of doing array assignmments with allocatable arrays requiring the compiler to detect whether the left-hand-side is already allocated to the correct size and reallocating if it is not. The object-oriented features are also lilkely to be slow.

hajek · ‎02-26-2007

The reason why assumed shape arrays work the way they do (all dimensions deferred) is primarily simplicity. Allowing intermixing explicit and assumed shape could give more information to the compiler; however, it would rise a number of complications - what can be passed to the dummy argument specified as dimension(3,:)? Only something dimensioned x(3,:), or can it be x(n,m) where n _happens_ to be 3? Can you use dimension(3,:,m,:) ? etc.
I'm not saying that these issues couldn't be resolved, but they would inevitably increase the complexity of assumed shape arrays. The committee instead decided to go with the simplest specification and TKR compatibility rule, hoping that compilers will eventually resolve the problems.
They have made a significant progress - I have recently (out of curiosity) developed a special code for calculating propeller performance in two synchronous versions - one with assumed shape arrays, and one with assumed size & explicit shape arrays. The code used many arrays up to 5 dimensions. Eventually I gave up with the assumed shape version - I needed to use slices of the arrays and the using storage associations for the assumed size got quite clunky. But the core computation already worked, and I was quite surprised that with Intel at -O3 -ip -whateverElseIdontRemember there was no measurable performance penalty for the assumed shape version.
Yes, assumed shape arrays, due to their runtime inquiring of dimensions, imply some "abstraction" penalty. Your case, which can generally be called "tiny leading dimension" is some of the worst cases. During the time I use F95, I gathered that the following might help:
1. when writing new, code, consider storing and passing the arrays "big dimension first", i.e. (n,3) instead of (3,n), or even the "matlabic" style xp(:),yp(:),zp(:) instead of p(:,:). You could be surprised that this is often an equally (or more, when, e.g., z-coordinates are not used at all) sensible approach, and enables you to use vector operations more heavily.
2. when you have an array p(:,:), and you know that size(p,1) == 3, write expressions such as
p(:,i) = 2*p(:,i) - p(:,j)
like this
p(1:3,i) = 2*p(1:3,i) - p(1:3,j)
(or p(1:3,i) = 2*p(:,i) - p(:,j) often suffices, but looks asymmetric to me)
such that the compiler does get the information that it is a 3-cycle loop and usually uses complete unrolling. I have seen cases when this thing alone reduced the performance penalty from 130% to 20%.
3. consider copying the parts of the arrays that are operated in inner loops to local automatic variables.

4. consider writing a "type wrapper" for the small vector, such as
TYPE vec3d
real:: x(3)
END TYPE
and then, use
type(vec3d),intent(inout):: p(:)
p(i)%x = 2*p(i)%x - p(j)%x
this is usually almost zero-penalty approach, only it disables using p as a 2d array (say, in a matrix multiplication). On the other hand, you may often find out that there is more to add to the type than coordinates, that improves locality of reference.

5. If you don't need to pass various slices around, consider avoiding assumed-shape arrays at all. Note that explicit shape and assumed size arrays are by no means obsolete features (assumed size was in F95, but this was corrected in F2003), neither is DO loop.

Hope this helps
Jaroslav

TimP · ‎02-26-2007

I thinkJaroslav diverged from the original topic, but I have to comment on the question of p(3,n) vs p(n,3). It is true, evidently, that the later provides for more efficient vectorization on Xeon platforms. This is likely to be the most important consideration. However, assuming all 3 components are used together in a loop, it consumes 3 read or write combine buffers, which could become a problem when several such arrays are in use, on a Xeon platform.

Certain legacy codes actually did give some thought to this, and may have been designed to perform well on platforms with extremely small cache (by present standards).So, the considerations which were in force over 10 years ago are different from the present.

joseph-krahn · ‎02-27-2007

There a a few different reasons to 'modernize' code. The most obvious is to get rid of hacks like dynamic memory allocation for F77, which is not very portable. There is also a big benefit in cleaning up ugly F77 code which you want to build on to, and want it to be maintainable, especially if the F77 code is already buggy.

I realize that the flexibility offered by dynamic arrays lose a lot of optimization opportunities that you get by static dimensions. There can be a huge performance hit due to aliasing if pointer or target attributes are used. These are the two main things that made F77 faster than C.

In the example you give of "A(:)=A(:)+B(:)" versus a DO loop version, I don't see distinction. I would expect a modern optimizer to produce the same result in either case, which may or may not include a temporary array. In other words, a vectorized DO loop seems just as likely to generate a temporary array.

As for my question about not having a static first dimension for a 2-dimensional allocatable, such as '(3,:)', it seems that there were good reasons to avoid the extra complication. However, you can always pass the array through an assumed-size subroutine interface, where the subroutine can have a fixed first dimension.

Overall, I try to avoid unnecessary dynamic arrays and complexity that are easy in F90. I suppose what I'm really wondering which things have the largest effect. Am I avoiding some things that are not very significant, while missing something else that is significant?

Of course, it is all very compiler dependent, and likely to change when F2003 and F2008 features are added. Maybe the thing to do is to start putting together some benchmark code that tries multiple forms of the same thing. This would also make it easy to assess how optimization affects them.

joseph-krahn · ‎02-27-2007

To test the difference of arry syntax versus a DO loop, I compiled the routines below into assembly with Ifort. With only default optimization, both routines produce exactly the same assembly.

subroutine test1(a,b,n)
integer :: n
real :: a(n), b(n)
a=a+b
end subroutine test1
subroutine test2(a,b,n)
integer :: n
real :: a(n), b(n)
integer :: i
do i=1,n
a(i)=a(i)+b(i)
end do
end subroutine test2

Steven_L_Intel1 · ‎02-27-2007

I said there's a semantic difference, and there is. The compiler goes through a lot of effort to recognize when the extra loop and the temp aren't needed.

hajek · ‎03-01-2007

I think that the Fortran committe long ago released the idea of Fortran being a language for writing super-fast core codes to screw every tiny bit of performance out of the processor. That's what C, the "portable assembler", is for - after all, Fortran is not a good language for performance tuning because it is strongly amenable to compiler optimizations. It is true that a static array can offer a small performance advantage - but that is rarely the main reason for performance losses, because once the array is passed into a subroutine, it does not matter any more. Repeated allocation and deallocation (e.g., in inner loops) hurts performance in any language, and it's better to avoid it.

It seems obvious that most of the new array stuff is targeted to more "Matlab-like" array programming, where an occassional temporary is something you can live with, and the innermost loops are mostly hidden in vector operations. This is IMHO why the intrinsics are array-valued functions, not subroutines with output arguments, and why assumed shape arrays are simple yet cause performance troubles in "low-level" code. The reward is that you can treat arrays quite like scalars, and let the compiler do its heavy optimizations. I suppose that this can also make Fortran attractive for commercial compiler vendors.
The cool thing is that great care has been taken to make new-style coding compatible with the old style, so that there are not inevitable performance losses. Most experts recommend to take advantage of modern features and write clear code, and then use profilers and other tools to find the hot spots and try to optimize those (or replace by tuned subprograms).
One of the defficiencies (IMHO) of F95/F2003 is that it is _impossible_ to tell whether an assumed shape array is contiguous (or partially contiguous), and write a specialized code for the case, if it happens to be necessary. But note that most compiler vendors (including Intel) offer extensions like LOC intrinsic that can be used to resolve this problem. Intel is, in particular, great at optimizing modern Fortran code; nevertheless, it offers a plethora of extensions to make code faster should one feel smarter than the compiler.
Many thanks to Intel for their Fortran compiler!
Jaroslav