- Page 2

Simon_C · ‎03-03-2015

Hi Intel, Anyone,

I've been running an application in extended precision (REAL*16) for more years than I like to count and, finally, the time has come when I need it to execute faster. My impression is that all I can do is reduce, as far as possible, the number of extended precision multiply and divide operations. (This is what VTune tells me is what is taking the most time.) I can certainly do this, but is there more I should know?

Extended precision and its needs seems to be a neglected area of development. Two things caused me to write this post. (1) I received an email flyer for an Intel HPC Code Modernisation Workshop that offered, among other things to "Learn how to modernize your legacy code or develop brand-new code in order to maximize software performance on current and future Intel Xeon and Xeon Phi processors". I asked if there was anything in the workshop relevant to extended precision programs. Answer: "No". (2) I found that mutiplying two REAL*16 variables that are both equal to zero seemed to take a long time - perhaps as much as non-zero values. That isn't what I would have guessed (not being a computer scientist), and it makes me wonder if I should test whether values are equal to zero *before* multiplying them (to avoid the cost of that operation in terms of time). This is just one possible question of many.

If there is a good guide to optimisation of programs running in extended precision, please point me to it. It would be nice if someone at Intel would write that guide if it doesn't exist. Dr Fortran, perhaps?

Thanks,

Simon C.

Simon_C · ‎03-07-2015

Jim, FortranFan,

I will do the suggested tests, and thanks also for those results above. I omitted to mention that my program has always been easily altered to run using either REAL*8 & INTEGER*4, or REAL*16 and INTEGER*8. (In the former case I have to make sure that the problems being solved don't require the extended precision, of course.) Looking back at old notes I see that the speed penalty of running with REAL*16 and INTEGER*8 numbers appears to be a about an order of magnitude, perhaps a little worse. It will be interesting to see if that is consistent with what we see from testing individual operations.

I'll try and post again next week. I've got to do some work!

SImon C.

TimP · ‎03-07-2015

Fortran fan appears to compare vectorized double against nonvectorized quad precision so the big performance difference isn't surprising .

in the dot product quoted earlier in the thread,vectorization ought to save 2 or 3 bits precision in the double case so the Nag double double could give "only" 12 more bits precision; less if the poorer underflow behavior comes into play.

FortranFan · ‎03-07-2015

Tim Prince wrote:

Fortran fan appears to compare vectorized double against nonvectorized quad precision so the big performance difference isn't surprising .

..

No, I explicitly specified to disable vectorization: my compilation options included /Qvec- /O3 /heaparrays0 /standard-semantics. So unless /O3 overrides /Qvec- (does it?), I figure vectorization did not come into play.

andrew_4619 · ‎03-07-2015

I have no need for quad precision but I am finding this discussion interesting. Isn't comparing vectorised double against non-vectorised quad precision a reasonable thing to do as the problems being discussed can have a lot of vectorisation (in double) if restructured a little and this is part of the penalty of quad?

jimdempseyatthecove · ‎03-08-2015

app4619,

You are correct with respect to arrays of quad precision. Judiciously one may need to have the metrics on hand to determine which and when to keep components in real(8) and incur the overhead of conversion from real(8) to real(16). IOW compute what you can using lesser precision. Then convert at the last moment.

Jim Dempsey

Simon_C · ‎03-12-2015

Everyone,

I have put all my comparisons into the attached file. There are results for 32 and 64 bit executables, using Intel, NAG, and Lahey compilers. (The Lahey one is a few years old, and only produces 32 bit executables). The calculations were done in a rather simple way, but I hope they are useful as to guide to code optimisation - ie, making programs with REAL*16 real numbers go faster. That is going to be my use for them.

A few notes:

* For the Intel compiler (producing a 64 bit executable) addition, subtraction, and multiplication of two real numbers is about 7x slower for REAL*16 than REAL*8. (See section 3 of the table)

* Multiplication is preferred over division, which is much slower.

* When carrying out the arithmentic operations multiple times in a single statement, there is little increase in execution time for up to about 8 REAL*8 variables, but for REAL*16 variables the execution time goes up with each extra operation in the statement.

* Testing for zero before doing a multiplication (in which one of the variables being multiplied is zero) is likely to be useful. That is to say, the cost of the test is less than that of the multiplication. The NAG compiler seems to handle this best, then Intel, (Lahey not so well).

* 64 bit executables are faster than 32 bit for REAL*16 variables, but there is no appreciable difference for REAL*8.

* I didn't subtract the time taken to run the DO loops, and the timer, from the results. This number is given as "assign" in the file. Making this subtraction causes the speed differences between REAL*8 and REAL*16 to appear larger, especially for single operations

Simon C.

mecej4 · ‎03-12-2015

I am puzzled by your use of -O0 or /Od, implying "do not optimize", when the whole purpose of the exercise is to compare performance.

What is the usefulness in comparing perhaps the worst performance timings for the compilers involved, possibly using obsolete X87 instructions on a modern CPU with SSE2 capabilities that may go unutilized because of the options used?

FortranFan · ‎03-12-2015

Simon Clegg wrote:

Everyone,

I have put all my comparisons into the attached file. There are results for 32 and 64 bit executables, using Intel, NAG, and Lahey compilers. ..

It'll better if the following is considered:

As suggested by Jim, REAL*16 is non-standard and range and precision can vary significantly with compilers. Better to employ explicitly and clearly defined real kinds using SELECTED_REAL_KIND or use named constants from ISO_FORTRAN_ENV and report their precision and range.
As suggested by mecej4, the comparison studies should include optimization as done by Polyhedron.
As suggested by Tim Prince and app4619, vectorization and no vectorization cases should be evaluated.

jimdempseyatthecove · ‎03-12-2015

Simon,

That is a great report. Thanks for taking the time to write the app, run the tests, and produce the report.

A surprise in the report is that relativisticly, division is faster than subtraction (it absolute terms it is not). This might lead one to assume that internally fewer bits of the divisor are used. Or the choice of numbers used in the test program permitted fewer bits to be used in the divisor.

Jim Dempsey

Simon_C · ‎03-12-2015

In response to the point above "I am puzzled by your use of -O0 or /Od, implying "do not optimize", when the whole purpose of the exercise is to compare performance."

Two reasons for doing so, I think. First, the purpose of using -O0 is to understand the time taken by basic arithmetic operations and nothing else. In fact my simple code would not even work without optimisation being switched off (because the assignment statement gets removed from the loop, and just executed once). Second, as has been pointed out earlier in the thread, many optimisations that can be used for double precision don't work to for real*16. (Quote from one of the messages above: "As there is no vectorization nor much instruction level parallelism for real(16)...".) This isn't true of all the optimizations the compiler can do, certainly, as I have seen improvements in program speed going "up" the optimisation levels (-O0 -> -O1, and so on).

Simon C.

jimdempseyatthecove · ‎03-13-2015

Simon,

I agree with you. It is important to know the scalar comparative performance. As it is relatively easy to make an educated guess (approximation) at the impact vectorization with real(8) and real(4).

Jim Dempsey

Simon_C · ‎12-16-2015

Everyone,

To finish, I will describe what I did as a result of the speed tests above:

(1) The execution time of my program was dominated (roughly 75%) by the evaluation of an objective function that contained expressions that were mostly nested summations - loops within loops - with many mutiply and divide operations. There were many repeated summations and code segments, as the original coding exactly reproduced a set of equations that were written for simplicity and not efficiency.

(2) Before recoding, I completely re-wrote the equations, mainly to eliminate repetition and thereby reduce the number of arithmetical operations.

(3) I then coded the new equations, paying attention to the results of the speed tests. This meant reducing the number of divide, log, and exp operations; and also testing for zero values. (Where there are zeros the calculation need not be done. In quad precision, multiplying two values one of which is zero takes just as long as multiplying two non-zero values.)

(4) Results: my re-written code was faster than the old code by a factor of 6, and the program faster by about a factor of 3. Success! And my re-written code is even easy to understand (not always the case with optimised code).

Thank you all for your assistance.

Simon C.

Speeding up execution of REAL*16 programs