Speeding up execution of REAL*16 programs

Simon_C · ‎03-03-2015

Hi Intel, Anyone,

I've been running an application in extended precision (REAL*16) for more years than I like to count and, finally, the time has come when I need it to execute faster. My impression is that all I can do is reduce, as far as possible, the number of extended precision multiply and divide operations. (This is what VTune tells me is what is taking the most time.) I can certainly do this, but is there more I should know?

Extended precision and its needs seems to be a neglected area of development. Two things caused me to write this post. (1) I received an email flyer for an Intel HPC Code Modernisation Workshop that offered, among other things to "Learn how to modernize your legacy code or develop brand-new code in order to maximize software performance on current and future Intel Xeon and Xeon Phi processors". I asked if there was anything in the workshop relevant to extended precision programs. Answer: "No". (2) I found that mutiplying two REAL*16 variables that are both equal to zero seemed to take a long time - perhaps as much as non-zero values. That isn't what I would have guessed (not being a computer scientist), and it makes me wonder if I should test whether values are equal to zero *before* multiplying them (to avoid the cost of that operation in terms of time). This is just one possible question of many.

If there is a good guide to optimisation of programs running in extended precision, please point me to it. It would be nice if someone at Intel would write that guide if it doesn't exist. Dr Fortran, perhaps?

Thanks,

Simon C.

mecej4 · ‎03-03-2015

it makes me wonder if I should test whether values are equal to zero *before* multiplying them

. Probably not a good idea. There is no special bypass treatment available to handle the case where one or both multiplicands are zero. Doing a comparison, examining the MXCSR register and taking a branch instruction to skip multiplication of zeroes would not only consume CPU cycles, but adversely affect vectorization.

The biggest speed increase should be expected from doing away with REAL*16 arithmetic as much as possible. Since this is not a native type implemented in the processor, the arithmetic has to be simulated. A multiplication of two REAL*16 numbers would require four REAL*8 multiplications, three REAL*8 additions, and additional work to handle carries and overflows.

TimP · ‎03-03-2015

As there is no vectorization nor much instruction level parallelism for real(16), legacy programs which check whether an operand is 0 before running a loop using that operand are likely doing the right thing. As mecej4 hinted, this would not be as good as real(8) without such a check.

I suppose a real(16) multiply might be accomplished with 3 integer(8) multiplies, plus some shifts and adds, but it still has to take several times as long as non-vector real(8).

If you have little hope of vectorizing your application by real(8), you might consider OpenMP, since you would likely be running on a multi-core CPU.

Martyn_C_Intel · ‎03-03-2015

128 bit precision doesn't get much attention because of the perception that there is not much demand, (e.g. compared to the demand for more cores or more efficient memory access). I don't know of a guide, but general works on numerical programming may be relevant. The C standard now supports 128 bit precision, so visibility is increasing.

In typical programmes, the step that requires additional precision is accumulation (summation of many terms). Lesser precision is often sufficient for the evaluation of individual terms. You might investiigate whether this could apply to you. Also look for cancellations - subtractions of two quantities that are systematically nearly equal. Sometimes, such expressions can be rewritten to reduce the numerical uncertainties caused by rounding, and so avoid the need for higher precision.

There also exist summation methods that can give high precision results without the use of high precision native arithmetic. Even converting a linear sum of very many elements to a tree algorithm can increase precision. (To a degree, this happens automatically when you thread and/or vectorize a summation).

If zero values of quad precision variables are frequent, testing for them might be worthwhile, you'd need to do the measurement. For double precision or lower, when vectorization is possible, it's not a good idea.

jimdempseyatthecove · ‎03-03-2015

Can you show us your heaviest use function/subroutine. Maybe there is something we can point out that will speed you up.

Jim Dempsey

Simon_C · ‎03-04-2015

Everyone,

Thank you for your responses so far. I'll reply to the suggestions and questions, starting from the top:

(1) mecej4 and Martyn Corden: removing REAL*16 arithmetic where possible, and use techniques to avoid loss of precision (in summations for example).

The nature of the problem precludes this. It is a minimisation of a function with multiple variables "x". The contribution of each x(i) to the function value is roughly proportional to the value of the variable (ie, each x(i)). Thus if, at the solution to the problem, the range of the x values from the smallest to the largest covers many orders of magnitude - over 20 say - the minimum of the function has to be located to well over
20 digits of precision to ensure an accurate solution. Of course there are alternative ways of formulating the problem (chemical equilibrium in a system with many species). However, the method my program uses is very robust, and very general. Other ways, which would only require REAL*8 arithmetic, all have their own problems.

I suspect that users of REAL*16, like me, are more interested in increased precision (EPSILON) rather than the ability to handle extremely large or small numbers (TINY and HUGE). I think the compiler of the Numerical Algorithms Group (NAG) recognises this fact and uses a different approach to Intel in their implementation of extended precision. It is faster. Here are the outputs of the functions:

Intel, double precision
digits =     53
base   =      2
epsilon   = 0.2220446049250313080847263336181640625E-0015
biggest   = 0.1797693134862315708145274237317043568E+0309
smallest = 0.2225073858507201383090232717332404064E-0307

Intel, extended precision

digits =    113
base   =      2
epsilon   = 0.1925929944387235853055977942584927319E-0033
biggest   = 0.1189731495357231765085759326628007016E+4933
smallest = 0.3362103143112093506262677817321752603E-4931

NAG, double precision

digits =     53
base   =      2
epsilon   = 0.2220446049250313080847263336181640625E-0015
biggest   = 0.1797693134862315708145274237317043568E+0309
smallest = 0.2225073858507201383090232717332404064E-0307

NAG, extended precision

digits =    106
base   =      2
epsilon   = 0.2465190328815661891911651766508706968E-0031
biggest   = 0.8988465674311579538646525953945012877E+0308
smallest = 0.2004168360008972777996108051350162048E-0291

I work with both compilers, but the Intel one is more widely used which is a big advantage.

(2) Tim Prince: I think I will follow your suggestion about checking the operand. Your second paragraph about real*16 multiplies: the speed penalty of running in with real*16 arithmetic for the program as a whole is a factor of between 5 and 10. Running on a multi-core CPU is something (thank gooodness) I don't think I'll need that unless my program is embedded within a larger program and called hundreds or thousands of times. Following on from your second paragraph, I suppose I should rank common operations carried out in the most heavily used parts of the code according to time taken. Am I right in thinking that is as simple as writing a short program including the relevant operations, using the system clock to time them (but repeating each operation thousands of times because the system clock has a resolution at the level of milliseconds rather than microseconds or shorter)?

(3) Jim Dempsey. The most heavily used routines have multiple, nested (to 3 or 4 levels), summations in which several quantities are multiplied together and then added to the total. At present these routines are run once for each variable x(i) in each iteration of the minimization procedure. However, many of the nested summations are common or very similar for all x and need only be calculated once per iteration instead of n times (where n is the number of variables). This is a straightforward improvement to implement, and is the one I'm focused on, and
doesn't really merit the attention of forum participants.

My post was really to draw Intel's attention to this use of their compiler, and ask if there anything else I should know. I don't want to have to return to the code in a month or a year because I wasn't aware of something that would have enabled me to get further significant speed increases now. Here is a typical summation:

      SUMUnca = 0.D0
      DO 37 K=1,NNiter
        SUMca = 0.D0
        DO 17 I=1,NCiter
          SUMA = 0.D0
          DO 18 J=1,NAiter
           SUMA = SUMA + 2*xAN(J)*(CATchrg(I)+ANchrg(J))**2 / (CATchrg(I)*ANchrg(J))*Unca(K,I,J)
18         CONTINUE
          DUM = (CATchrg(I)+zX)**2 / (zX*CATchrg(I))*Unca(K,I,nSELECT)
          SUMca = SUMca + xCAT(I)*(DUM - SUMA)
17      CONTINUE
        SUMUnca = SUMUnca + xNEUT(K)*SUMca
37    CONTINUE

There are no function references in the code fragment above, just real arrays and variables. There are many, many, sets of summations
like this. (Despite the 'old' appearance of this code the program is now entirely modular, in the Fortran sense, and uses no 'include' files or common blocks. I'm rather proud of that!)

Simon C.

andrew_4619 · ‎03-04-2015

What might the sizes of NNiter, NCiter and NAiter be? The calculation in the inner most loop may be reduced by creating some intermediate arrays outside. BTW the multiplication by 2 may as well occur once after the 18 loop rather than Naiter times.

FortranFan · ‎03-04-2015

Simon Clegg wrote:

... Of course there are alternative ways of formulating the problem (chemical equilibrium in a system with many species). However, the method my program uses is very robust, and very general. Other ways, which would only require REAL*8 arithmetic, all have their own problems.

..

Have you published your work in any peer-reviewed journal? For I will be interesting in studying your method further. Every package I know seems to robustly and generally reliably compute chemical equilibrium with double precision arithmetic and if there are issues, they are more due to gaps in the speciation formulation and uncertainties and inaccuracies in the underlying thermodynamic data than numerical precision. So your comments related to the need for quadruple precision in such calculations are surprising.
Since your computations involve a lot of summations using DO loops and your interest is in improving CPU performance, you may want to investigate approaches using vectorization/parallelization/multiprocessing, etc. and see if that can help (e.g., OpenMP). Take a look at this forum topic: https://software.intel.com/en-us/forums/topic/539969 and see how Jim shows improves simple loop performance using !DIR$ IVDEP and !DIR$ VECTOR NONTEMPORAL (Quote #17) - perhaps there are some options along these lines that can help you.
Your code snippet for summation shows mixed mode arithmetic for the J loop (DO 18). As shown, it is no big deal; but the question remains if that is representative of a coding style and whether mixed mode arithmetic is prevalent elsewhere in code that can cause imprecision.

Martyn_C_Intel · ‎03-04-2015

First, to repeat a comment made earlier: real(16) arithmetic cannot be vectorized. There are no 128 bit floating-point instructions, (either scalar or vector); real(16) operations have to be simulated in software, which is why they are slow.

Second, the NAG compiler looks to have a "double double" implementation of extended precision, using a pair of real(8) variables. This is faster but less precise and does not conform to the IEEE standard for 128 bit floating-point arithmetic. The Intel compiler has chosen to be compatible with the IEEE standard, which in some contexts may make code more portable.

Steven_L_Intel1 · ‎03-04-2015

IBM also used the "double double" implementation in their AIX compilers. This does extend precision but not range. Our implementation uses integer arithmetic in library routines.

jimdempseyatthecove · ‎03-04-2015

What is the precision requirements of the variables (not expression) on the right side of:

SUMA = SUMA + 2*xAN(J)*(CATchrg(I)+ANchrg(J))**2 / (CATchrg(I)*ANchrg(J))*Unca(K,I,J)

Are your xxxchrg(.) values quantizable (can be expressed as an integral multiple of some value)?

If so, then can you calculate to the left of /, and to the right of /, using integer(8)?
If so, do so, then convert the two integer(8)'s to real(16), and then perform the / and + operations.

Jim Dempsey

Simon_C · ‎03-05-2015

Everyone,

Thank you for your comments and questions. I'll reply below, starting with the post following my last one.

(1) app4619, and Jim Dempsey (questions about the code fragment):
NNiter, NCiter, NAiter are generally small (less than 10). I appreciate your point about the "2". CATchrg and ANchrg, though REAL variables, actually have integer values. In any given problem they are constant so I intend to calculate the complete quantity "(CATchrg(I)+ANchrg(J))**2 / (CATchrg(I)*ANchrg(J))" for all I and J and store the results in an array for use where needed. Your point about using integer arithmetic where possible is well taken. Thanks!

(2) FortranFan:
The reason for using extended precision is given in item (1) of my second post, above. A complete description of the way the problem is formulated - and perhaps a better explanation of the need for extended precision - can be found in A. S. Wexler and S. L. Clegg (2002) J. Geophys. Research, Vol. 107, No. D14, 4207, 10.1029/2001JD000451. The problem to be solved is a linearly constrained minimisation of the Gibbs energy in which gradients are calculated analytically. The solver (a commercial one) uses a sequential quadratic programming method and is based upon work by Gill and others at Stanford.

(3) Martyn Corden and Steve Lionel:
No vectorisation, and I assume no prospect of it. If Intel puts out questionnaires to its users to gauge future needs I hope it will consider a few questions about REAL*16 usage. I wonder if we are as small a minority as we appear. It would be good to know. I'd forgotten that IBM did double double. I've never used any of their software products though, except OS/2. (And I thought that was great!)

(4) Everyone:
I will do some simple speed tests of different operations in REAL*16 and post the results here over the next several days.

Simon C.

mecej4 · ‎03-05-2015

I would be interested in trying out some things on a small to medium size program with the property that the results are definitely inaccurate unless quadruple precision is used. Can you provide one in source form?

FortranFan · ‎03-05-2015

Simon Clegg wrote:

... A complete description of the way the problem is formulated - and perhaps a better explanation of the need for extended precision - can be found in A. S. Wexler and S. L. Clegg (2002) J. Geophys. Research, Vol. 107, No. D14, 4207, 10.1029/2001JD000451. The problem to be solved is a linearly constrained minimisation of the Gibbs energy in which gradients are calculated analytically. ..

Thanks, I've procured the article you coauthored and reviewed it quickly. The article doesn't elaborate on the specific need for quadruple precision arithmetic and doesn't give any specific examples where other approaches based on double precision prove inadequate and how your approach helps.

The article briefly indicates the main motivation to use quadruple precision is to be able to solve for the extremely low volatility species in the gas phase. You must be familiar with the work by Gordon and McBride (e.g., NASA Reference Publication 1311, 1996), Gautam and Seider (AICHE J., Vol. 25, No. 6, 991-1016 (1979)), Smith and Missen (Chemical Reaction Equilibrium Analysis, 1982), etc. to name a few "open literature" sources and with commercial packages such as ChemSage, Aspen PLUS, OLI, etc. Generally, all of these approaches work reasonably well for even more complex problems by making use of double precision arithmetic only; solving for very dilute concentrations of species (including polymers, biomolecules, ionic liquids, etc.) in particular phases is not where the challenges lie. Using suitable formulations of the Gibbs' free energy expression (equation 23 in your article) and solving for transformed quantities (natural logarithm or other transcendental functions of the mole number) and the calculus of chemical potentials at infinite dilution, reliable solutions are provided for low concentration species in phases, including aspects such as the equilibrium of H2SO4=H2O+SO3 in gas phase.

As I mentioned earlier, the challenges in this field are not computation of trace species concentration, but being able to determine and utilize more accurate characterizations of the Gibbs' free energy of phases. The article simplifies the problem greatly by assuming gas phase chemical potentials are given by partial pressures (thereby ignoring the highly non-linear relationship between mole fraction and fugacity coefficients) and seems to employ similar reductions for the concentration dependence of liquid-phase activities. If one were to consider the general applicability of the computational approach in the article, these aspects involving the overall Gibbs' free energy characterization of the system are of more concern than those by other researchers who strive harder for accurate thermodynamic characterization but who may make use of "math approximations" in their solution of dilute species concentrations.

The idea of "all computations with the models are carried out in extended (quadruple) precision, which enables G to be calculated to a precision of greater than 30 digits" seems like using a sledgehammer to crack a nut. In fact, the reviewers of the article since it is in a peer-reviewed journal (if they were astute enough) could have justifiably challenged it. Consider this: at room temperature and ambient pressure, the assumption that fugacity is equal to partial pressure for H2SO4 can have more than a couple of percent error since the fugacity coefficient of H2SO4 in air containing about 0.1 %-mole H2O on the order of 0.97..; it is nowhere close to unity.

Nonetheless, if you continue to investigate quadruple precision arithmetic and report back on this forum, those with a genuine need for it might benefit. It's unclear if it truly helps with aerosol models, at least in the manner as described in the article.

jimdempseyatthecove · ‎03-06-2015

The purpose of using integer(8) was not for speed-up of small quantizable numbers. Rather it was for use of larger quantizable numbers, whos products, on the left and right of /, would fit within 64 bit signed number, and would not necessarily fit within the real(8) 52-bit mantissa. This would facilitate:

4 integer(8) operations to left of /
3 integer(8) operations to right of /
2 integer(8) to real16 conversions
one real 16 / operation (providing the higher precision at the point required)
one real16 + operation

Note, if the remainder of your program can use real(8)'s and if the values left and right of / can fit within the 52-bit mantissa, then it might not be advantageous to use integer(8), as the conversion time will be additional overhead.

Jim Dempsey

Simon_C · ‎03-07-2015

To FortranFan: I will reply to you directly. I think, from what you have written, that there are still some misunderstandings that need clearing up. The method I use is certainly not the only one, but it's a good one: the problems to be solved are easy to formulate, and the method is robust and uses an easily available minimisation routine. Use of REAL*16 arithmetic is a small price to pay for these benefits and, I emphasise, it is *not* used in order to calculate thermodynamic quantities to some absurd level of precision. I will explain shortly.

Everyone,

I have done some simple speed tests of different arithmetic operations on REAL*16 variables. The operations were run in loops, each repeated 10^9 times, and with compiler flag -Od so that it doesn't spot that the operation is simply repeated (and can therefore be removed from the loop and executed once). The 'cost' of running the loop itself appears to be about 10% of that of the addition operation. This cost was subtracted from all results but is really too small to make much of a difference.

If these tests are too simple and naive to be realistic, please let me know. Here they are:

----------------------------------------------------------------------------------
Relative time taken for different arithmetic operations
on two REAL*16 variables, scaled to unity for the addition
of the two variables. Intel Fortran compiler, v13.1.
----------------------------------------------------------------------------------

Add      : 1.00
Multiply : 1.82
Log      : 5.31
Exp      : 4.60
Multiply by Zero (a): 0.76
Multiply by Zero (b): 0.37
Multiply by Zero (c): 0.78

(a) This is for the operation x3 = x1 * x2 where x1 is zero, and x2 is nonzero.

(b) This is for the operation:

IF(X1 .NE. 0.E0) THEN
x3 = x1 * x2
ELSE
x3 = 0.E0
ENDIF

where x1 is zero, and x2 is nonzero.

(c) This is for the operation:

IF(X1.NE.0.E0 .AND. X2.NE.0.E0) THEN
x3 = x1 * x2
ELSE
x3 = 0.E0
ENDIF

where x1 is zero and x2 is non-zero.

I conclude from the results for the last three cases that there is a potential time saving if only one of the two
variables is tested for being non-zero, but not both.

----------------------------------------------------------------------------------
Relative time taken for different arithmetic operations
on two REAL*16 variables, scaled to unity for the addition
of the two variables. NAG Fortran compiler, v6
----------------------------------------------------------------------------------

Add       : 1.00
Multiply : 1.76
Log       :120.0
Exp : 93.7
MultZero:   1.98

For items (a-c), which are not shown, the compiler seemed to be able to spot that I was repeating a
single operation, despite optimisation being turned off, so scaling to the time taken for addition doesn't yield
meaningful results.

------------------------------------------------------------------------------------
Comparison of absolute times taken by the two compilers
------------------------------------------------------------------------------------

                      NAG      Intel
Add              :    0.6331   1.66
Multiply        :    1.116    3.03
Log              :      75.9       8.86
Exp              :      59.3       7.68
Multiply by zero : 1.25     1.27

So, under the conditions of my tests, the NAG "double double" approach does additions and multiplications faster but there
may be something weird going on with LOG and EXP. Seems a very long time to take.

Simon C.

jimdempseyatthecove · ‎03-07-2015

Simon,

Thanks for the stats. However, I'd like to make this suggestion.

You produced results relative to Add of REAL16. (and you did not include division by the way).

I think it would be beneficial to compare:

Add16 relative to Add8
Multiply16 relative to Multiply8
...

And when comparing the performance of the two compilers, you must clearly state the precision differences if any. The two implementations are not necessarily equal in terms of precision.

Jim Dempsey

mecej4 · ‎03-07-2015

Simon, I second Jim's comments about using Add16 to scale the results. I note something really curious: according to your results, the NAG version takes more time to multiply by zero than with other numbers!

jimdempseyatthecove · ‎03-07-2015

Also for add, it wouldn't hurt to show add of its self (A = A + A)

And Add (SUM = SUM + A) where SUM is very large and A is very small (but where SUM isn't too large to force A out of the precision). The reason for this is Add requires shifting of mantissa. The amount of shift is proportional to the magnitude difference.

An additional metric to produce is the conversion time from REAL(8) to REAL(16), and from INTEGER(8) to REAL(16). Having these figures might come in handy in pre-determining as to if some tweaking might be beneficial (IOW if using INT8 on both sides of /, then using convert to REAL16 before division)

It might also be worth determining if:

REAL16 / INT8

has been optimized by the Intel compiler writers. I would venture to guess that this would be low hanging fruit for optimization (same as /2, /4, /8, ...)

Jim Dempsey

FortranFan · ‎03-07-2015

jimdempseyatthecove wrote:

Simon,

Thanks for the stats. However, I'd like to make this suggestion.

You produced results relative to Add of REAL16. (and you did not include division by the way).

I think it would be beneficial to compare:

Add16 relative to Add8
Multiply16 relative to Multiply8
...

And when comparing the performance of the two compilers, you must clearly state the precision differences if any. The two implementations are not necessarily equal in terms of precision.

Jim Dempsey

I could recall even simple addition operation using quadruple precision being significantly slower than double precision not too long ago; the test below suggests it is an order of magnitude difference now, a huge penalty indicating quadruple precision arithmetic should be employed very judiciously and selectively.

 Add: double precision
      Range:  307
  Precision:  15
   Trial  1
 CPU Time:    0.078 seconds.
   Trial  2
 CPU Time:    0.078 seconds.
   Trial  3
 CPU Time:    0.078 seconds.
   Trial  4
 CPU Time:    0.062 seconds.
   Trial  5
 CPU Time:    0.062 seconds.
 Average:     0.072 seconds.

 Check:
  first row: ((a+b)/c - 1) =  0.00000000000000
   last row: ((a+b)/c - 1) =  0.00000000000000

 Add: quadruple precision
      Range:  4931
  Precision:  33
   Trial  1
 CPU Time:    0.718 seconds.
   Trial  2
 CPU Time:    0.718 seconds.
   Trial  3
 CPU Time:    0.718 seconds.
   Trial  4
 CPU Time:    0.718 seconds.
   Trial  5
 CPU Time:    0.718 seconds.
 Average:     0.718 seconds.

 Check:
  first row: ((a+b)/c - 1) =  0.00000000000000000000000000000000
   last row: ((a+b)/c - 1) =  0.00000000000000000000000000000000
Press any key to continue . . .

Here's the code:

program p

   use, intrinsic :: iso_fortran_env, only : i4 => int32, dp => real64, qp => real128

   !..
   implicit none

   !..
   integer(i4), parameter :: n = 2**25
   integer(i4), parameter :: MAXREPEAT = 5
   integer(i4) :: counter
   integer(i4) :: istat
   real(dp), allocatable :: a_dp(:)
   real(dp), allocatable :: b_dp(:)
   real(dp), allocatable :: c_dp(:)
   real(qp), allocatable :: a_qp(:)
   real(qp), allocatable :: b_qp(:)
   real(qp), allocatable :: c_qp(:)
   real(dp) :: Start_Time = 0_dp
   real(dp) :: End_Time = 0_dp
   real(dp) :: CpuTimes_Add_dp(MAXREPEAT)
   real(dp) :: CpuTimes_Add_qp(MAXREPEAT)
   character(len=*), parameter :: FMT_CPU = "(A, T12, F8.3, A)"
   character(len=2048) :: ErrorAlloc

   !..
   call alloc_data()

   !..
   CpuTimes_Add_dp = 0.0_dp

   print *, "Add: double precision"
   print *, "     Range: ", range(a_dp(1))
   print *, " Precision: ", precision(a_dp(1))

   Loop_Add_dp: do Counter = 1, MAXREPEAT

      write(*,'(A,I2)') "   Trial ", Counter

      !.. Initialize the array using random numbers
      call random_number(a_dp)
      call random_number(b_dp)

      !..
      call cpu_time(Start_Time)

      !..
      c_dp(:) = a_dp(:) + b_dp(:)

      call cpu_time(End_Time)

      !..
      CpuTimes_Add_dp(Counter) = (End_Time - Start_Time)

      write(*, fmt=FMT_CPU) " CPU Time: ", CpuTimes_Add_dp(Counter), " seconds."

   end do Loop_Add_dp

   !..
   write(*, fmt=FMT_CPU) " Average: ",                                  &
      sum(CpuTimes_Add_dp)/real(MAXREPEAT, kind=DP),  " seconds."

   print *
   print *, "Check:"
   print *, " first row: ((a+b)/c - 1) = ", (a_dp(1)+b_dp(1))/c_dp(1) - 1.0_dp
   print *, "  last row: ((a+b)/c - 1) = ", (a_dp(n)+b_dp(n))/c_dp(n) - 1.0_dp
   print *

   !..
   CpuTimes_Add_qp = 0.0_dp

   print *, "Add: quadruple precision"
   print *, "     Range: ", range(a_qp(1))
   print *, " Precision: ", precision(a_qp(1))


   Loop_Add_qp: do Counter = 1, MAXREPEAT

      write(*,'(A,I2)') "   Trial ", Counter

      !.. Initialize the array using random numbers
      call random_number(a_qp)
      call random_number(b_qp)

      !..
      call cpu_time(Start_Time)

      !..
      c_qp(:) = a_qp(:) + b_qp(:)

      call cpu_time(End_Time)

      !..
      CpuTimes_Add_qp(Counter) = (End_Time - Start_Time)

      write(*, fmt=FMT_CPU) " CPU Time: ", CpuTimes_Add_qp(Counter), " seconds."

   end do Loop_Add_qp

   !..
   write(*, fmt=FMT_CPU) " Average: ",                                  &
      sum(CpuTimes_Add_qp)/real(MAXREPEAT, kind=DP),  " seconds."

   print *
   print *, "Check:"
   print *, " first row: ((a+b)/c - 1) = ", (a_qp(1)+b_qp(1))/c_qp(1) - 1.0_qp
   print *, "  last row: ((a+b)/c - 1) = ", (a_qp(n)+b_qp(n))/c_qp(n) - 1.0_qp

   !..
   call dealloc_data()

   !..
   stop

contains

   subroutine alloc_data()

      allocate(a_dp(n), SOURCE=0.0_dp, stat=Istat, errmsg=ErrorAlloc)
      if (Istat /= 0) then
         print *, " Allocation of a_dp failed: ", ErrorAlloc(1:len_trim(ErrorAlloc))
         stop
      end if

      allocate(b_dp(n), SOURCE=0.0_dp, stat=Istat, errmsg=ErrorAlloc)
      if (Istat /= 0) then
         print *, " Allocation of b_dp failed: ", ErrorAlloc(1:len_trim(ErrorAlloc))
         stop
      end if

      allocate(c_dp(n), SOURCE=0.0_dp, stat=Istat, errmsg=ErrorAlloc)
      if (Istat /= 0) then
         print *, " Allocation of c_dp failed: ", ErrorAlloc(1:len_trim(ErrorAlloc))
         stop
      end if

      allocate(a_qp(n), SOURCE=0.0_qp, stat=Istat, errmsg=ErrorAlloc)
      if (Istat /= 0) then
         print *, " Allocation of a_qp failed: ", ErrorAlloc(1:len_trim(ErrorAlloc))
         stop
      end if

      allocate(b_qp(n), SOURCE=0.0_qp, stat=Istat, errmsg=ErrorAlloc)
      if (Istat /= 0) then
         print *, " Allocation of b_qp failed: ", ErrorAlloc(1:len_trim(ErrorAlloc))
         stop
      end if

      allocate(c_qp(n), SOURCE=0.0_qp, stat=Istat, errmsg=ErrorAlloc)
      if (Istat /= 0) then
         print *, " Allocation of c_qp failed: ", ErrorAlloc(1:len_trim(ErrorAlloc))
         stop
      end if

      !..
      return

   end subroutine alloc_data

   subroutine dealloc_data()

      deallocate(a_dp, stat=Istat, errmsg=ErrorAlloc)
      if (Istat /= 0) then
         print *, " Deallocation of a_dp failed. ", ErrorAlloc(1:len_trim(ErrorAlloc))
      end if

      deallocate(b_dp, stat=Istat, errmsg=ErrorAlloc)
      if (Istat /= 0) then
         print *, " Deallocation of b_dp failed. ", ErrorAlloc(1:len_trim(ErrorAlloc))
      end if

      deallocate(c_dp, stat=Istat, errmsg=ErrorAlloc)
      if (Istat /= 0) then
         print *, " Deallocation of c_dp failed. ", ErrorAlloc(1:len_trim(ErrorAlloc))
      end if

      deallocate(a_qp, stat=Istat, errmsg=ErrorAlloc)
      if (Istat /= 0) then
         print *, " Deallocation of a_qp failed. ", ErrorAlloc(1:len_trim(ErrorAlloc))
      end if

      deallocate(b_qp, stat=Istat, errmsg=ErrorAlloc)
      if (Istat /= 0) then
         print *, " Deallocation of b_qp failed. ", ErrorAlloc(1:len_trim(ErrorAlloc))
      end if

      deallocate(c_qp, stat=Istat, errmsg=ErrorAlloc)
      if (Istat /= 0) then
         print *, " Deallocation of c_qp failed. ", ErrorAlloc(1:len_trim(ErrorAlloc))
      end if

      !..
      return

   end subroutine dealloc_data

end program p

TimP · ‎03-07-2015

Compilers typically do consider the replacement a*2 => a+a. Intel compilers make optimization like a/2. => a*.5 dependent on -no-prec-div.

the Harris 6000 series implemented some instructions for operations between real*6 and integer*3 but that evidently didn't ensure survival of the platform.