Long Execution Time

Guaglardi__Paul · ‎01-31-2018

Edited subroutines and functions that ran very well in CVF to be suitable for INTEL fortran, later version of the software. Altered lines to have ampersand, &, at the end of the current line as opposed to the start of the subsequent line. Working with REAL 16. Left "IF" blocks with ".LE." in lieu of "<=", for example. Debugging is enabled. Results of an execution yield almost the same exact result in the INTEL version as was achieved in the CVF version. Problem is it takes an ENORMOUS amount of time to run. Any ideas? Thanks in advance.

andrew_4619 · ‎01-31-2018

Real(16) is a huge huge overhead compared to real(8) do you actually need real(16) that is very unusual? It is hard to make any comments based of the lack of information given.

mecej4 · ‎01-31-2018

One aspect of editing a working program that merits suspicion is that a mistaken change may have caused the main algorithm to change in complexity. The impaired algorithm still converges and gets home, but in a limping way.

For instance, if you have a quick-sort subroutine, and an intended cosmetic change (moving comments, changing variable names, source format, etc.) may cause an unintended change to the algorithm. Instead of O(N lg N), the complexity becomes O(N²). For N = 10⁶, the sorting part of the program will run 50,000 times slower.

If you want a more specific answer, you will need to show the code or a simplified version of it.

Steve_Lionel · ‎01-31-2018

Or run the program under a profiler such as Intel VTune Amplifier XE and see where it is spending its time.

IanH · ‎02-01-2018

The newer compiler does more runtime debugging checks. If you want performance, turn off the runtime checks.

The editing of the source that you describe should not have been required.

Guaglardi__Paul · ‎02-01-2018

Yesterday, manipulating switches through the properties, I decided to go back to Real *8 as the program was set in CVF. Sure enough, the execution time was reduced dramatically. From 24 minutes to 26 seconds. Optimization reduced the execution time to 15 seconds. Apparently, holding the increased accuracy requires further calculations due to hardware characteristics.

mecej4 · ‎02-01-2018

Guaglardi, Paul wrote:
Apparently, holding the increased accuracy requires further calculations due to hardware characteristics.

"Hardware characteristics" as in "It ain't there!". REAL*16 and COMPLEX*32 arithmetic has to be simulated in software using 64-bit floating point arithmetic, since there are almost no mainstream processors today that have 128-bit floating point arithmetic. Intel's software version of 128-bit floating point prioritizes precision over speed.

Don't use REAL*16 without a thorough assessment of whether it is needed and the performance hit that it entails.

jimdempseyatthecove · ‎02-01-2018

Generally, one does not need REAL*16 throughout an entire program (if at all). Try restricting the use to where you absolutely require that precision.

Jim Dempsey

Bernard · ‎02-07-2018

>>>Optimization reduced the execution time to 15 seconds. Apparently, holding the increased accuracy requires further calculations due to hardware characteristics>>>

There is no HW acceleration (CPU register based) of REAL*16 primitive data type. When you will choose this data type representation all the releveant calculations will be emulated in software probably operating on stack located variables and that's mean a lot of load/store operation through the whole program execution.

Bernard · ‎02-07-2018

Steve Lionel (Ret.) wrote:

Or run the program under a profiler such as Intel VTune Amplifier XE and see where it is spending its time.

If REAL*16 is emulated in software VTune will show probably a lot of time spent on load, store and floating-point arithmetic operations.

I'm not sure how ifort will store 128-bit numbers will it use lower part(s) of YMM/ZMM resgister to load only single quad-prec number?

Steve_Lionel · ‎02-07-2018

REAL(16) is not loaded into registers as such. The software library almost certainly does integer loads of the various parts.

Bernard · ‎02-07-2018

So REAL*16 types are treated as double-double values and their decomposition is loaded into vector registers. One variable will occupy two registers lower parts.

Steve_Lionel · ‎02-08-2018

No, they are not treated as “double double values”. I don’t know what the internals of the Intel quad-precision library look like, but I have extensive experience with DEC’s and I very much doubt vector registers are used at all, nor floating point instructions.