From FORTRAN 77 to C?

nh2 · ‎12-26-2003

We have a mathematical model written in FORTRAN 77. This model consist for the most part of smallpure functions that uses FPU math library functions heavily(a**b andexp, acos, asin, tanh, cos, sin, divisions). It has been decided that we should re-implement this model and we wonder whether it would be a good idea to translate the code from FORTRAN 77 to C? Would the translation from FORTRAN 77 to C improve the efficiency of the program? Does the version 8.0 of the Intel C/C++ compiler generate any faster code then the version 8.0 of the Intel Visual Fortran compiler? Or would this translation be a waste of time?

Lars Petter Endresen

TimP · ‎12-26-2003

With the 8.0 compiler (IMO), a concerted effort has been made to fix the few situations where the IA32 C compiler has generated more efficient code than Fortran. Those generally involve mixtures of integer and floating point. C, by design, doesn't have the potential of Fortran to optimize exponentiation operations like Fortran a**b. For the other math functions you quote, there is little inherent difference between C and Fortran, but you are likely to need to specify explicitly the float or double versions of the functions, until C99 tgmath has become more widely and efficiently implemented. gcc has better facilities for in-lining math functions than g77, but this difference doesn't carry over to Intel Fortran or to Xeon. The 8.0 IA64 C compiler still has some distance to go to acquire all feasible optimizations already performed by Fortran.

Those math functions depend on parallelization for full efficiency, either the -xN vectorization for Xeon, or the SWP for Itanium. Considerations for in-lining and optimizing them are much the same in C and Fortran.

nh2 · ‎12-26-2003

Thanks for the very interesting information.

"but you are likely to need to specify explicitly the float or double versions of the functions"

By inspecting the number of clock cycles spent inmath functions (using MS C++ 7.1), and by inspecting the assembly dump, it seems to me that:

1) the number of clock cycles is independent of whether the variable is a double or a float, and

2) the assembly call to the math function is not changedif the variable is changed from a double to a float.

Does the Intel version 8.0 of the compilers have math functions that are more efficient for single than double precision?

"C, by design, doesn't have the potential of Fortran to optimize exponentiation operations like Fortran a**b."

I have noticed that rewriting a**0.25 to sqrt(sqrt(a)) had a dramatic effect on the computational time.

How does Fortran optimize exponents? Is it possible to calculate the N'th root a**(1/N) as efficient as sqrt(a)?

TimP · ‎12-27-2003

When you talked about switching to C for better optimization, I had no idea you would choose MSC. I don't believe it has any automatic vectorization or SWP for math functions. It may not have SSE versions of the math functions, in case you are working on IA32 processors. That certainly is not an inherent characteristic of C, but if you are looking for all C compilers to have more processor-specific optimization than most Fortrans, you will be disappointed. It is certainly valid for a compiler to promote math functionarguments to double, in the K&R C fashion, but Intel compilers take advantage of declared precision. That was among the reasons for the introduction of the C standard 14 years ago.

As you noticed, you can't count on Fortran compilers to recognize fractional exponents and optimize them, although it may happen sometimes. **1.5 is another common case where substitution of sqrt() is likely to gain. In-line expansion of small integer powers is expected of good Fortran compilers, although not mandated by any standard.

If you want efficient coding of Nth root, you will likely have to write it out yourself. Cube root is provided, as an extension to the standard, by a few Fortran math libraries.

nh2 · ‎12-27-2003

Thanks again for the very useful help. This is indeed very interesting!

We are now using CVF 6.1and we will switch to Intel Fortran or Intel C++

depending on what you recommend.

What you write about the math library function for cube root is very interesting, as the functions in ourcode thatuses most CPU time involves frequent calls to a**(1.0/3.0)!! Is this cube root a part of the Intel Fortran compiler? Or is it possible to link this from a library from another compiler? The square root is very efficient, it uses something like 38 clock cycles on my Pentium 4 computer, and is thus nearly 10 times faster than the a**b (or pow(a,b) in C) which used more than 280 clock cycles, so we have replacedfor examplea**2.5 with a*a*sqrt(a) and are very happy with the reduced CPU time. Do you know if the qube root is as efficient as the square root?

How much faster would the single precision version of the math library function of pow(a,b)be than the corresponding double precision version? Two times faster? Would this involve calls to the Intel Approximate Math Library for SSE/SSE2? Unfortunately our code is scalar and it is not so easy to vectorize due to the many many if-then-else statements everywhere in the code, thus SSE/SSE2 may not be so easy to applyto our code.

TimP · ‎12-28-2003

I would think that one of the current Intel Fortran compilers ought to be satisfactory. 7.1 is more mature than 8.0, 8.0 has more complete CVF compatibility. Both versions work with the same license, and both have continued support.

For a cube root function, I think you will have to rely on a Google search or the like. Apparently, there is one in ACM TOMS archive, and the version posted at http://www.worldserver.com/turk/opensource/CubeRoot.c.txt looks like a possibility. As there is no direct hardware support for cube root, my guess is it would take about 3 times as long as sqrt().

I doubt there'sa great potential time saving in general over the C pow() function, unless you can engage SSE short vector coding. The advantage Fortran has is that a constant integer exponent can be replaced with in-line expansion. As you mentioned, you would have to dictate in-line expansion when it can be done efficiently by some combination of sqrt(). The way you wrote a**2.5 should be getting some benefit from parallelism, by calculating a*a and sqrt(a) in parallel. SSE could give you some speed advantage on sqrt(), even without vectorization.

nh2 · ‎12-28-2003

I am most gratful for the truly useful recommendations!

:-)

I would recommend that Intel implement the cube root in the next generation of Pentium 5 processors!

:-)

I will try the CubeRoot function in C and also check if we can gain a little if we write it directly in inline assembly. This routine has 24 bits accuracy. Would it be easy to extend it to 53 and 64 bit accuracy also, by adding additional lines like this:

r = (double)(2.0/3.0) * r + (double)(1.0/3.0) * x / (r * r); /* 12 bits of precision */ r = (double)(2.0/3.0) * r + (double)(1.0/3.0) * x / (r * r); /*24 bits of precision */ r = (double)(2.0/3.0) * r + (double)(1.0/3.0) * x / (r * r); /*36 bits of precision?*/ r = (double)(2.0/3.0) * r + (double)(1.0/3.0) * x / (r * r); /*48 bits of precision?*/

Please forgive me if the following question seems a little strange, but I am a beginner only in this field, and I am thusstill a little confused about precision. Aren't all the followingFPU instructions always calculated with 10 byte floating point represenation?

FADD: Addition
FMUL: Multiplication
FDIV: Division
FDIVR: Division
FSIN: Sine (uses radians)
FCOS: Cosine (uses radians)
FSQRT: Square Root
FSUB: Subtraction
FABS: Absolute Value

If we write for example A = COS(B) in FORTRAN 77 wouldn't that calculation always be carried out with 10 byte representation internaly in the FPU both if the variable is single or double precision? (I know that the extended precision of the ST(0)..ST(7) registerswill be lost when the variable is written to RAM.)As it is not so easy to determine whether single or double precision is required to have good enough accuracy in the implementation of our mathematical model, it would be nice to be able to switch between single,double and extended precision as easy as possible. Would the best procedure be to implement the model in double precision,but switch back and forth between single, double precision and extended precision using:

_control87(0x00020000, 0x00030000) (24 bits)
_control87(0x00010000, 0x00030000) (53 bits)
_control87(0x00000000, 0x00030000) (64 bits)

The _control87 routine may be called once in the beginning of the program to set the accuracy of the FPU. Will this affect the accuracy and efficiency of all the FPU instructions listed above?

Best regards from Lars Petter Endresen

TimP · ‎12-28-2003

Apparently, cbrt() and cbrtf() are already supported in the libraries for the linux C compilers, as well as in newlib for gcc on Windows x87.

http://www.intel.com/software/products/opensource/whats_new.htm

shows that Intel expended some effort on behalf of Itanium. It would not be too difficult to look up the public sources (all asm, I believe) and modify them to be directly callable by Fortran. You could submit a premier.intel.com feature request issue, or do it yourself.

In the C source code Newton iteration version, adding iterations would increase the precision. 2 more iterations ought to give well beyond double precision, if you assure that the operations are done in double (SSE2 or x87). For the rational approximation, you would have to study the methods for fitting them, or check it has already been done in the cbrt() source.

I suppose that you would not gain the full accuracy potential of x87 code, given that Windows compilers set 53-bit precision mode. To answer your question fully, you would have to study the manuals, but I believe all the instructions you mention are subject to the precision mode setting, except fsin and fcos.

You could change the precision mode at run time yourself, as you suggest, but you may find unexpected inconsistencies due to partial implementation of extended precision. CVF includes such a library call; I haven't checked whether it carries over into ifort 8.0. I have never heard of anyone embedding precision mode changes into compiler generated code; the cost in performance would be far too high, and SSE/SSE2 support 24-bit and 53-bit precision well. Intel and Microsoft Windows compilers don't adequately support extended precision, and I don't expect that to change, with the increasing importance of SSE/SSE2.

nh2 · ‎12-29-2003

Thanks for the links to newlib. I think we will implement the cube root inassembly, using the single precision quadric rational polynomial approximation, which according to:

http://www.worldserver.com/turk/computergraphics/CubeRoot.pdf

involves 16 FPU multiplications and 1 FPU division. If the code is written to take maximum advantage of the Pentium 4 out of order execution core, maybe one could manage to do some of the 16 multiplications in parallell? What do you think? In this case the number of clock cycles for the routine would be around or less than 100, whichwould bequite nice indeed! As the Intel Visual Fortran compiler does not allow inline assembly, we would have to make a tiny library cbrt.lib that will contain one function only (FORTRAN 77 syntax):

double precision function CBRT(double precision x)

(it would be nice to have a double precision interface despite the fact that the accuracy of the routine is single precision only!) The routines I found at gcc newlib were written in low level C and contained so many FPU divisions that a call to the pow() routine would probably be more efficient.

I want to apologize somewhat for all my strange questions about precision. However, it not so easy to convince the quality assurance department here that the accuracy is sufficient. But now I think that I have enough informatiom about the issue to convince them.

Regarding premier.intel.com, would it be possible to ask them for some advise on how to implement the quadric rational polynomial approximation in assembly to gain maximum performance from the Pentium 4 out of order execution core?

This I think is the tricky part of the programming, getting the name decoration the rigth way, i.e. _cbrt, _cbrt@8should be quite easy I think. How does one postquestions to premier.intel.com?

Best regards from Lars Petter Endresen

TimP · ‎12-29-2003

The integer divides are likely to be even more costly than the fp division. You will need SSE operations to carry out moves between integer and floating point registers efficiently. C compilers may insist that the & (address-taking) operation requires going through memory, so there you may wish to use lower-level equivalent coding.

The numerator and denominator evaluation should be performed effectively in parallel. On anin order machine you would have to interleave the operations to achieve that; it should occur automatically with out of order. This should not be difficult for you to check experimentally, if you wish to take the time. Start with thecode generated by your compiler, try variations on it, checking correctness and performance.

I'm not totally convinced that you will gainmuch with asm coding, compared to a high quality SSE/SSE2 compiler. If you don't want to use Intel C, you could try translating the whole thing to Fortran.

As the architecture-specific forums were discontinued a few months ago, I can't suggest any better Intel-hosted way to discuss this. If you dogo with Intel C, you might find more low-level coding expertise on the C forum.

If you wish to post an issue on premier.intel.com, use the account which you are invited to open when you install the compiler. Select All Products, Intel Fortran (or C, if you are using icc)forWindows. Fill in the requested version information. Feature request might be appropriate. Upload example source code which demonstrates what you are trying to do and the performance improvements you expect.

durisinm · ‎12-29-2003

Can someone educate me on what SSE/SSE2 is?

Mike D.

Steven_L_Intel1 · ‎12-29-2003

These are the Single Instruction Multiple Data (vector) instructions added to the Pentium III (SSE) and Pentium 4 (SSE2). SSE is Intel's name and stands for Streaming SIMD Extension. The "processor code-named Prescott" (I still have no idea what it will really be called, except that it WON'T be "Pentium 5") has yet another set of these, known as SSE3.

Common to SSE is a set of vector registers that are separate from the normal IA-32 registers. SSE was single-precision only, SSE2 was double-precision. The Intel compilers know how to vectorize code to make effective use of the SSE feature. As Tim says, the SSE operations do not have "modes" - single-precision is 24-bit, double is 53.

nh2 · ‎12-29-2003

I am very grateful for all the useful comments to my questions. I would like to wish you a happy new year!

:-)

invariant-invariant · ‎01-30-2004

(Note: this message was posted to the Intel C++ also!)

Thanks a lot for the help. I have received some comments directly from Intel also, regarding the cbrt function in C++. Here is the FORTRAN intertface to that function. This interface works for any calling convention, STDCALL, CDECL andFASTCALL. We have found that this function is at least four times faster than writing X**0.3333333333333333 in FORTRAN.

Question: Would this interface result in an inlined cbrt in FORTRAN? Or is it possible to write aan interface which is more efficient? The interface should be invariant with calling convention. We are using Intel Visual Fortran 8.0.

DOUBLE PRECISION
FUNCTION CBRTC(X)

DOUBLE PRECISION X

INTERFACE

DOUBLE PRECISION FUNCTION CBRT(Y)

DOUBLE PRECISION Y

!DEC$ ATTRIBUTES C, ALIAS:'_cbrt' :: CBRT

END FUNCTION CBRT

END INTERFACE

CBRTC = CBRT(%VAL(X))

END

Steven_L_Intel1 · ‎01-30-2004

You would not get inlining with this, but it seems like a good solution.

invariant-invariant · ‎02-03-2004

The following information may be useful for this forum.

A mathematical model implemented in FORTRAN 77 has been compiled in two different ways:

1) Directly, using Intel Visual Fortran 8.0.

2) Indirectly by translating the code from FORTRAN 77 to C using F2C, and then compile using Intel C++ 8.0.

We have checked that the two codes produces exactly the same numerical results, and that the translated C-code (from F2C.EXE) looks very good. As the code only contains arthimetical, logical and math functions (no character strings or io at all) only, the F2C translated code was nearly identical to the original FORTRAN 77 code.

The result is that the Intel's Fortran compiler generates more efficient code than the Intel's C++ compiler for all test cases. In some cases the Fortran code is up to 50% faster.

No reason to go for C instead of Fortran!

Best wishes

Lars Petter Endresen