Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28893 Discussions

Runtime overhead of using CLASS and TYPE-BOUND PROCEDURE

ojacquet
Novice
6,186 Views

1      Introduction

There are very few sources of information to understand how object-oriented programming with the Fortran language would induce significant performance degradations. However, this question is often asked or addressed in forums and the answers of the "most experts" often consist of saying that object-oriented programming:

  • is not correctly mastered and implemented,
  • does not enormously penalize the calculations,
  • has many other benefits that compensate for performance degradation.

These answers are not really convincing for those who place a high value on computational times, especially those who develop scientific computing programs. Moreover, these questions do not seem to arise with the C++ language. This raises the question of whether object-oriented programming in Fortran has been used extensively in real scientific computing codes and whether it is not discarded in the most critical parts of the code from the point of view of computation time.

In 2011, the user thomas_boehme of the Intel Fortran compiler created a topic on this forum: Runtime overhead of using CLASS dummy arguments. He asked for help with the degradations caused by the use of classes and type-bound procedures. The problem observed was highlighted by a simple test program.

Based on this test program, two more comprehensive Fortran programs have been written.

The version of the Intel Fortran compiler used for these tests is as follows: ifort (IFORT) 2021.10.0 20230609

 

2      Conducting Tests

2.1     Fortran Test Program No. 1

2.1.1    Description

The program, given in file test1.f90, is based on the MyType type, defined with a single Val field of type real(8) and several attached procedures (type-bound procedures):

  • ClassArgAdd,
  • ClassArgEncapsTypeArgAdd,
  • ClassArgEncapsClassArgAdd,
  • ClassArgEncapsTbpClassArgAdd (only when the preprocessor group1 macro is greater than or equal to 4 or when the preprocessor calc macro is 112 or 212).

 

Different routines are defined, calculating the sum of two variables:

 

  • TypeArgAdd: has arguments A and B of type: type(MyType) and calculates the sum of the Val fields of the two variables:

A%Val = A%Val + B%Val

 

  • ClassArgAdd: has arguments A and B of type: class(MyType) and calculates the sum of the Val fields of the two variables:

A%Val = A%Val + B%Val

 

  • TypeArgEncapsTypeArgAdd: has arguments A and B of type: type(MyType) and calls the TypeArgAdd routine in the standard way:

CALL TypeArgAdd(A,B)

 

  • TypeArgEncapsClassArgAdd: has arguments A and B of type: type(MyType) and calls the ClassArgAdd routine in the standard way:

CALL ClassArgAdd(A,B)

 

  • TypeArgEncapsTbpClassArgAdd: has arguments of type: type(MyType) and calls the ClassArgAdd routine as a type-bound procedure:

CALL A%ClassArgAdd(B)

 

  • ClassArgEncapsTypeArgAdd: has arguments A and B of type: class(MyType) and calls the TypeArgAdd routine in the standard way:

CALL TypeArgAdd(A,B)

 

  • ClassArgEncapsClassArgAdd: has arguments A and B of type: class(MyType) and calls the ClassArgAdd routine in the standard way:

CALL ClassArgAdd(A,B)

 

  • ClassArgEncapsTbpClassArgAdd: has arguments A and B of type: class(MyType) and calls the ClassArgAdd routine as a type-bound procedure:

CALL A%ClassArgAdd(B)

The latter routine is defined only when the preprocessor's group1 macro is greater than or equal to 2 or when the preprocessor's calc macro is 111, 112, 211, or 212).

 

26 elementary calculations are defined, theoretically equivalent, consisting of the realization of 1010 sums realized as follows:

  • 100 and 200: direct summation of the Val fields of the two variables,
  • 101 and 201: call the TypeArgAdd routine,
  • 102 and 202: call the ClassArgAdd routine in the standard way,
  • 103 and 203: call the ClassArgAdd routine as a type-bound procedure,
  • 104 and 204: Call the TypeArgEncapsTypeArgAdd routine,
  • 105 and 205: Call the TypeArgEncapsClassArgAdd routine,
  • 106 and 206: call the TypeArgEncapsTbpClassArgAdd routine,
  • 107 and 207: Call the ClassArgEncapsTypeArgAdd routine in the standard way.
  • 108 and 208: Call the ClassArgEncapsTypeArgAdd routine as a type-bound procedure.
  • 109 and 209: call the ClassArgEncapsClassArgAdd routine in the standard way,
  • 110 and 210: Call the ClassArgEncapsClassArgAdd routine as a type-bound procedure.
  • 111 and 211: Call the ClassArgEncapsTbpClassArgAdd routine in the standard way.
  • 112 and 212: Call the ClassArgEncapsTbpClassArgAdd routine as a type-bound procedure.

 

Elementary calculations 100 to 112 operate on two variables declared with the type: type(MyType).

Elementary calculations 200 to 212 operate on two variables declared with the type: class(MyType), allocatable, and allocated with the type: type(MyType).

 

2.1.2    Elementary calculations

Initially, each elementary calculation is carried out separately.

 

To do this, the test1.f90 file is compiled:

  • by assigning the calc macro used by the preprocessor the id of the elementary calculation (from 100 to 112 and from 200 to 212): -fpp -Dcalc=id
  • with each of the following four compilation option combinations:
    • -O2
    • -O2 -ip
    • -O2 -ipo
    • -O2 -ipo -ip

 

The CPU times of each elementary calculation are given in the Table 1 according to an arbitrary time unit: the computation times were divided by the lowest time obtained.

 

It appears that:

  • The vast majority of elementary calculations, regardless of the compilation option used, have a performance equivalent to the most direct calculation.
  • The -O2 compilation option has the worst performance on some elementary calculations.
  • The -ip and -ipo compilation options, combined with the -O2 option, significantly improve performance, the -ipo option being the more advantageous of the two. The combination of the -ip and -ipo options does not provide any improvement over the -ipo option.
  • Calculations 112 and 212, both of which call the ClassArgEncapsTbpClassArgAdd routine as a type-bound procedure, are significantly slower than the others. This routine itself calls the ClassArgAdd routine as a type-bound procedure.

Table 1 : Test No. 1. Computational time of separate elementary calculations (arbitrary unit)

 

ojacquet_5-1698773310348.png

 

2.1.3    Grouped calculations 1

In a second step, the elementary calculations are carried out in a grouped manner.

 

To do this, the test1.f90 file is compiled:

  • By setting the group1 macro used by the preprocessor to a value of 1 to 5, which as it increases, allows more calculations to be included:
    • -fpp -Dgroup1=1: Elementary calculations 100 to 110 and 200 to 210,
    • -fpp -Dgroup1=2: Elementary calculations 100 to 111 and 200 to 210 – added calculation 111 and definition of the ClassArgEncapsTbpClassArgAdd routine,
    • -fpp -Dgroup1=3: Elementary calculations 100 to 111 and 200 to 211 – addition of calculation 211,
    • -fpp -Dgroup1=4: Elementary calculations 100 to 111 and 200 to 211 – added the ClassArgEncapsTbpClassArgAdd routine to the MyType type,
    • -fpp -Dgroup1=5: Elementary calculations 100 to 112 and 200 to 212 – added calculations 112 and 212.
  • with each of the following four compilation option combinations:
    • -O2
    • -O2 -ip
    • -O2 -ipo
    • -O2 -ipo -ip

 

For the 5 * 4 = 20 calculations grouped together, the CPU times of each elementary calculation is given in the Table 2 according to an arbitrary time unit: the computation times were divided by the lowest time obtained.

 

It appears that:

 

  • The performance of elementary calculations, when grouped together in a single calculation, can be degraded compared to the separate calculations.

 

  • With the -O2 compilation option, the degradation due to the grouping of calculations affects all but the most direct calculations (elementary calculations 100 and 200).

 

  • With the compilation options -O2 and -ip combined, the degradation due to the grouping of calculations occurs from group1=3 (i.e. when elementary calculation 211 is added) and concerns elementary calculations 201 to 210 (with variables of type class), whose computation time increases from 1 to a value between 1.6 and 2.3.

 

  • With the -O2 and -ipo compilation options combined, the degradation due to the grouping of calculations occurs from group1=4 (i.e. when the ClassArgEncapsTbpClassArgAdd routine is attached as a procedure to the MyType type) and involves:
    • elementary calculation 111 (with variables of type (MyType)), whose computation time increases from 1 to 7.5,
    • elementary calculations 201 to 211 (with variables of type class(MyType)), whose computation time increases from 1 to a value between 1.6 and 2.4.

 

In summary, the facts that are of concern are the following:

  • Adding some calculations may degrade the performance of other calculations.
  • Attaching a procedure to a type can degrade the performance of computations that use that type but do not use that procedure.

 

Table 2 : Test No. 1. Computational time of grouped elementary calculations 1 (arbitrary unit)

 

ojacquet_6-1698773492646.png

 

2.1.4    Grouped calculations 2

In the third step, the elementary calculations are still performed in a group manner but only with the -O2 compilation option.

 

To do this, the test1.f90 file is compiled:

  • by assigning the group2 macro used by the preprocessor a value from 1 to 6, which allows different calculations to be included in a specific order:
    • -fpp -Dgroup2=1: Elementary calculations 100, 102 to 105, 109, 110, 200, 202 to 205, 209 and 210,
    • -fpp -Dgroup2=2: Elementary calculations 100 to 105, 109, 110, 200 to 205, 209 and 210 – addition of calculations 101 and 201 to calculation 1,
    • -fpp -Dgroup2=3: Elementary calculations 100, 102 to 106, 109, 110, 200, 202 to 206, 209 and 210 – addition of calculations 106 and 206 to calculation 1,
    • -fpp -Dgroup2=4: Elementary calculations 100, 102 to 104, 109, 110, 105, 106, 200, 202 to 204, 209, 210, 205 and 206 – moved calculations 105, 106, 205 and 206 from calculation 2,
    • -fpp -Dgroup2=5: Elementary calculations 100 to 104, 109, 110, 105, 106, 200, 202 to 204, 209, 210, 205 and 206 – addition of calculation 101 to calculation 4,
    • -fpp -Dgroup1=6: Elementary calculations 100 to 104, 109, 110, 105, 106, 200 to 204, 209, 210, 205 and 206 – addition of calculation 201 to calculation 5,
  • only with the -O2 compilation option.

 

For the 6 grouped calculations, the CPU times of each elementary calculation are given in the Table 3 according to an arbitrary time unit: the computation times were divided by the lowest time obtained.

 

It appears that:

 

  • Calculation 2: Adding calculations 101 and 201 to calculation 1 results in an increase in the time of calculations 105 and 205.

 

  • Calculation 3: Adding calculations 106 and 206 to calculation 1 also results in an increase in the time taken to calculate 105 and 205.

 

  • Calculation 4: Moving calculations 105 and 206 after calculation 110 and calculations 205 and 206 after calculation 210 results in a decrease in the time of calculations 106 and 206.

 

  • Calculation 5: Adding calculation 101 to calculation 4 does not result in any changes.

 

  • Calculation 6: The addition of calculation 201 to calculation 5 results in a very large increase in the time of calculations 109 and 110 and a significant increase in the time of calculation 209.

 

 

In summary, the facts that are of concern are the following:

  • Adding some calculations may degrade the performance of other calculations.
  • Changing the order of some calculations can degrade or improve the performance of other calculations.

 

Table 3 : Test No. 1. Computational time of grouped elementary calculations 2 (arbitrary unit)

 

ojacquet_7-1698773562811.png

 

2.2     Fortran Test Program No. 2

2.2.1    Description

The program, given in file test2.f90, aims to test a situation closer to the cases encountered in practice, i.e. a first type to which procedures are attached, used to define a field in a second type to which procedures are attached that call procedures related to the first type.

 

As in test program 1, the first type is the MyType type, defined with a single Val field of type real(8) and two attached procedures (type-bound procedures):

  • ClassArgAdd,
  • ClassArgEncapsClassArgAdd.

 

Five wraparound types are defined containing a single Data field:

  • WrapTType, whose Data field is type: type(MyType),
  • WrapTPType, whose Data field is of type: type(MyType), pointer,
  • WrapTAType, whose Data field is of type: type(MyType), allocatable,
  • WrapCPType, whose Data field is of type: class(MyType), pointer,
  • WrapCAType, whose Data field is of type: class(MyType), allocatable.

 

In the future, the letter X may be replaced by T, TP, TA, CP and CA in the generic names of types and procedures.

 

Each of these 5 types wraps WrapXType have 5 procedures attached:

  • WrapDirectAdd pointing to the WrapXDirectAdd routine,
  • WrapClassArgAdd pointing to the WrapXClassArgAdd routine,
  • WrapTbpClassArgAdd pointing to the WrapXTbpClassArgAdd routine (only when the preprocessor's wraptbp macro is set),
  • WrapClassArgEncapsClassArgAdd pointing to the WrapXClassArgEncapsClassArgAdd routine,
  • WrapTbpClassArgEncapsClassArgAdd pointing to the WrapXTbpClassArgEncapsClassArgAdd routine (only when the preprocessor's wraptbp macro is set).

 

Different routines are defined, calculating the sum of two variables:

 

  • ClassArgAdd: has arguments A and B of type: class(MyType) and calculates the sum of the Val fields of the two variables:

A%Val = A%Val + B%Val

 

  • ClassArgEncapsClassArgAdd: has arguments A and B of type: class(MyType) and calls the ClassArgAdd routine:
    • as a type-bound procedure:

CALL A%ClassArgAdd(B)

    • Standard:

CALL ClassArgAdd(A,B)

depending on whether the preprocessor's encapstbp macro is set or not.

 

  • WrapXDirectAdd: has arguments A and B of type: class(WrapXType) and calculates the sum of the Val fields of the Data fields of the two variables:

A%Data%Val = A%Data%Val + B%Data%Val

 

  • WrapXClassArgAdd has arguments A and B of type: class(WrapXType) and calls the ClassArgAdd routine in the standard way:

CALL ClassArgAdd(A%Data,B%Data)

 

  • WrapXTbpClassArgAdd has arguments A and B of type: class(WrapXType) and calls the ClassArgAdd routine as a type-bound procedure:

CALL A%Data%ClassArgAdd(B%Data)

 

  • WrapXClassArgEncapsClassArgAdd has arguments A and B of type: class(WrapXType) and calls the ClassArgEncapsClassArgAdd routine in the standard way:

CALL ClassArgEncapsClassArgAdd(A%Data,B%Data)

 

  • WrapXTbpClassArgEncapsClassArgAdd has arguments A and B of type: class(WrapXType) and calls the ClassArgEncapsClassArgAdd routine as a type-bound procedure:

CALL A%Data%ClassArgEncapsClassArgAdd(B%Data)

 

 

125 elementary calculations are defined, theoretically equivalent, consisting of the realization of 109 sums realized as follows:

  • xy1: Call the WrapDirectAdd routine as a type-bound procedure.
  • xy2: Call the WrapClassArgAdd routine as a type-bound procedure.
  • xy3: Call the WrapTbpClassArgAdd routine as a type-bound procedure.
  • xy4: Call the WrapClassArgEncapsClassArgAdd routine as a type-bound procedure.
  • xy5: Calling the WrapTbpClassArgEncapsClassArgAdd routine as a type-bound procedure.

 

Where:

  • x takes the values 1 to 5:
    • 1: the type tested is the type WrapXType = WrapTType,
    • 2: the type tested is the type WrapXType = WrapTPType,
    • 3: the type tested is the type WrapXType = WrapTAType,
    • 4: the type tested is the type WrapXType = WrapCPType,
    • 5: The type tested is the type WrapXType = WrapCAType.
  • y takes the values 1 to 5:
    • 1: the variables used are of type: type(WrapXType),
    • 2: the variables used are of type: type(WrapXType), pointer,
    • 3: the variables used are: type(WrapXType), allocatable,
    • 4: the variables used are: class(WrapXType), pointer,
    • 5: The variables used are of type: class(WrapXType), allocatable.

 

2.2.2    Elementary calculations

Initially, each elementary calculation is carried out separately.

 

To do this, the test2.f90 file is compiled:

  • by assigning the calc macro used by the preprocessor the id identifier of the elementary calculation (xyz where x, y, and z vary from 1 to 5): -fpp -Dcalc=id
  • Setting or not defining the encapstbp macro used by the preprocessor, modifying the definition of the ClassArgEncapsClassArgAdd routine (calling the ClassArgAdd routine as a type-bound procedure or in a standard way),
  • with only one combination of compilation options: -O2 -ipo -ip

 

The CPU times of each elementary calculation are given in the Table 4 according to an arbitrary time unit: the computation times were divided by the lowest time obtained.

 

It appears that:

  • the vast majority of elementary calculations have a performance equivalent to the most direct calculation,
  • computations involving data fields of type class (pointer or allocatable) have seriously degraded performance (time from 10.5 to 13.5), when using routines calling other routines such as type-bound procedures,
  • calculations involving Data fields of type type have degraded performance (time of the order of 1.6), when the variables used are themselves declared as type (calculations 112 to 115),
  • Calculations based on the ClassArgEncapsClassArgAdd routine have very degraded performance (time of the order of 7.5 for x=1,2,3) or degraded performance (time of the order of 2.3 for x=4.5) when it calls the ClassArgAdd routine as a type-bound procedure (calculations xy4 and xy5 with encapstbp=1).

 

Table 4 : Test No. 2. Computational time of separate elementary calculations (arbitrary unit)

 

ojacquet_8-1698773784382.png

 

2.2.3    Grouped calculations

In a second step, the elementary calculations are carried out in a grouped manner.

 

To do this, the test2.f90 file is compiled:

  • Setting or not defining the encapstbp macro used by the preprocessor, modifying the definition of the ClassArgEncapsClassArgAdd routine (calling the ClassArgAdd routine as a type-bound procedure or in a standard way),
  • whether or not to set the wraptbp macro used by the preprocessor (which has the effect of keeping or eliminating the definition of the WrapXTbpClassArgAdd and WrapXTbpClassArgEncapsClassArgAdd routines as well as the calculations based on the call to these routines; the particularity of these routines being to call the ClassArgAdd and ClassArgEncapsClassArgAdd routines respectively as a type-bound procedure),
  • with only one combination of compilation options: -O2 -ipo -ip

 

For the 4 grouped calculations, the CPU times of each elementary calculation are given in the Table 5 according to an arbitrary time unit: the computation times were divided by the lowest time obtained.

 

It appears that:

 

  • The performance of elementary calculations, when grouped together in the same calculation, can be degraded or even very degraded compared to the separate calculation.

 

  • Depending on the type of variables used and the type of field, the behaviors are very diverse and this complexity is problematic or even inconsistent and therefore suspicious.

 

  • The use of type-bound procedures within routines that are themselves called type-bound procedures is very penalizing and can even impact the performance of calculations that do not use these type-bound procedures.

 

  • Using type-bound procedures for a field within a type-bound procedure is very penalizing. This seems contradictory to the philosophy of object-oriented programming.

 

Table 5 : Test No. 2. Computational time of grouped elementary calculations (arbitrary unit)

 

ojacquet_9-1698773840910.png

 

3      Conclusion

At the conclusion of the tests carried out, the unresolved questions are:

  • Is it acceptable for the performance of one part of a program to be modified by the addition or displacement of another part, which a priori does not have a direct link with the first part (the only link being the fact that they use the same type)?
  • What is the benefit of a type-bound procedure if performance is seriously degraded if it is used in another type-bound procedure? Isn't this at odds with the philosophy of object-oriented programming?
  • Are all these behaviours in line with the recommendations of the standard?
  • Are they due to an anomaly or a design flaw in the compiler? Can the compiler be improved?

 

 

0 Kudos
33 Replies
JohnNichols
Valued Contributor III
4,722 Views

You refered to an old post on this topic.

If I look at the code from the old post then 

SUBROUTINE Add(this, Original)
CLASS (MyType) :: this
CLASS (MyType) :: Original
  this%Val = this%Val + Original%Val 
END SUBROUTINE

SUBROUTINE AddDirect(this, Original)
TYPE (MyTypeExt) :: this
TYPE (MyTypeExt) :: Original
  this%Val = this%Val + Original%Val 
END SUBROUTINE

Screenshot 2023-10-31 155714.png

All of the difference comes down to these two routines.  with add direct about 5 times faster.   Whey you would use a class is beyond my skill set, but just use types.   Look at the assembler to see why there is a difference.

0 Kudos
JohnNichols
Valued Contributor III
4,716 Views

If I run your code in 64 bit debug I get on a Del Precision core i7 with a lot of memory

Screenshot 2023-10-31 161519.png

Screenshot 2023-10-31 161549.png

Screenshot 2023-10-31 161620.png

and then in 64 bit release

Screenshot 2023-10-31 161707.png

a second time

Screenshot 2023-10-31 161737.png

a third time

Screenshot 2023-10-31 161904.png

then 32 bit release.

 

Screenshot 2023-10-31 161928.png

 

With statistics you need to set the context, so my father's mother runs slower than my mother, one is 95 and one is 56.  

If I say, my grandmother runs slower than my  mother there is better context.  

You have a 20% COV in the code, the debug slows it up a lot. the slowness of the B%AA_type is the only thing worth looking at, but it is to hard at this hour. 

 

JohnNichols
Valued Contributor III
4,668 Views

My apologies for the abruptness of the last post at the end.  

I read a lot of papers on a lot of topics, but I found the presentation very hard to follow.  And the speed tests need means, standard deviations and your test limitations to make sense of it really, basic physics 101.  

Anyone one who reviews this sort of stuff is going to do basic tests and then in an anon review, tear apart the paper.  Trust me it has happened to me many times, and I have sometimes returned the favour.  

How did you do the tests is the first question.  

 

 

0 Kudos
ojacquet
Novice
4,514 Views

I will give some clarifications that are intended to help reproduce the results of tables 1 to 5 of the first post.

Next, we need to meditate on the values contained in these tables, which ideally should all be equal to 1.

The fact that this is not the case should be a cause for concern.

If an expert can explain all these differences, so much the better!

In any case, it is very disturbing and troublesome for a non-expert, who wants the concepts to be simple enough and the behavior of a program to be predictable.

So the question is whether all of these behaviors conform to the standard or whether they reveal anomalies in the compiler, with classes and type-bound procedures.

That said, some behaviors don't really seem to be consistent, including:

  • When elementary calculations are performed in a grouped manner (i.e. in the same program) the performance of elementary calculations may be degraded.
  • This can even be the case when changing the order in which the calculations are performed within the program.

 

The tests.tar.gz attachment file contains:

 

  • The test1.f90 file corresponding to the first test program

 

  • The compile_run_test1_sep.sh file to perform the basic calculations contained in the test1.f90 file separately:
    • with different combinations of compiler optimization options
  • The results files produced by running the compile_run_test1_sep.sh script (the exploitation of these files allows the construction of table 1 of the first post)

 

  • The compile_run_test1_group1.sh file to perform the basic calculations contained in the test1.f90 file in a grouped manner:
    • with different combinations of compiler optimization options
    • with different values of the group1 macro for the preprocessor (corresponding to different groupings of elementary calculations)
  • The results files produced by running the compile_run_test1_group1.sh script (the exploitation of these files allows the construction of table 2 of the first post)

 

  • The compile_run_test1_group2.sh file to perform the basic calculations contained in the test1.f90 file in a grouped manner:
    • with a single compiler optimization option
    • with different values of the group2 macro for the preprocessor (corresponding to different groupings of elementary calculations or even to a different order)
  • The results files produced by running the compile_run_test1_group2.sh script (the exploitation of these files allows the construction of table 3 of the first post)

 

  • The test2.f90 file corresponding to the second test program

 

  • The compile_run_test2_sep.sh file to perform the basic calculations contained in the test2.f90 file separately:
    • with a single combination of compiler optimization options
  • the results files produced by running the compile_run_test2_sep.sh script (the exploitation of these files allows the construction of table 4 of the first post)

 

  • The compile_run_test2_group.sh file to perform the basic calculations contained in the test2.f90 file in a grouped manner:
    • with a single combination of compiler optimization options
    • with different values of the encapstbp and wraptbp macro for the preprocessor
  • the results files produced by running the compile_run_test2_group.sh script (the exploitation of these files allows the construction of table 5 of the first post)

 

I work in a Linux environment, so the build and launch files are only valid for that environment. Either way, they explain how the calculations were made.

 

The result files give the CPU computation times of the elementary calculations.

These are not the times that are shown in tables 1 to 5 of the first post because the times have been divided by the minimum calculation time.

A value of 1 corresponds to the fastest calculations. The higher the value, the slower the calculation.

My goal is to draw attention to calculations that have values greater than 1.5 up to almost 20...

However, all these calculations should theoretically take the same amount of time!

I would like to point out that, given the great differences in performance observed, there is no need to make statistics (by reproducing these calculations several times).

 

Thank you very much for your attention.

 

Yours sincerely,

Olivier

0 Kudos
JohnNichols
Valued Contributor III
4,504 Views

One is taught the correct way to do things from a statistical viewpoint and one always does it that way, because the influential stats professor that taught you is still a God in your eyes and it is the right thing to do, cutting corners means you make mistakes somewhere where it is important.  

Your calculation lives inside a system, the system affects the results, so expecting two things that look the same to take the same time in a computer system is just unrealistic.  I take measurements day and day out, we record as a matter of course the time it takes to do each calculation, we record the time as we know if it is beyond a certain level it means we missed a record, if you have 99 records and record 54 takes twice as long as the others, then 54 is missing and it is really 55, or 56.  Then you can do the stats, we look for changes that are measured in parts per billion.  

Expecting the same performance from "similar" code is not allowing for the people who wrote the compiler, who do neat things, who spill their coffee and miss something, who pat the dog and put in a wasted step.  

I understand what you are trying to do, but there is too much explanation, make it as simple as one paragraph of the problem and the shortest code and then we look at it together.  

We are all busy people, the fact you got responses means some people have shown an interest, but they may not have a lot of time. KISS is a good principal, she always serves you well. 

0 Kudos
JohnNichols
Valued Contributor III
4,499 Views

Screenshot 2023-11-09 100829.png

This is the first Fortran program with the windows solution run in IFX on release mode.  Your ifdef do not work in this compiler.  Now I can see the results as a list, what are the problems,  ie why is 202 faster than 208. 

0 Kudos
jimdempseyatthecove
Honored Contributor III
4,492 Views

I haven't looked at the code, one of the causes of 202 vs 208 is when the compiler optimization register pressure is exceeded then intermediary address calculations are lost (iow must be recalculated).

If (when) you are on a newer CPU, and taking advantage of this (/QxHost, or other option enabling AVX512), then consider

 

/Qopt-zmm-usage:<keyword>
    Specifies the level of zmm registers usage. You can specify one of
    the following:
        low - Tells the compiler that the compiled program is unlikely to
                   benefit from zmm registers usage. It specifies that the
                   compiler should avoid using zmm registers unless it can
                  prove the gain from their usage.
      high - Tells the compiler to generate zmm code without restrictions

 

I believe the default is low

 

*** Note, do not blindly set this as in some cases it results in additional overhead in saving and restoring registers. If this does improve performance for this example, then consider using this feature on a source file by source file basis (iow set the property only on the fileName.f90 and not on the Project).

John, you can give this a try on your setup.

 

Jim Dempsey

0 Kudos
JohnNichols
Valued Contributor III
4,472 Views

First option - new data on the right

Screenshot 2023-11-09 135815.png

 

second set new data on right 

Screenshot 2023-11-09 140047.png

Some minor changes only. 

0 Kudos
JohnNichols
Valued Contributor III
4,472 Views
0 Kudos
jimdempseyatthecove
Honored Contributor III
4,457 Views

Don't know, would have to disect the disassembly.

 

It would appear (I am guessing) that the compiler optimization isn't efficient enough to catch an opportunity to perform the class identification at compile time.

 

Adding Inter-procedural Optimizations (multi-file) "may" help. But only if level of functions can be made to become inline (such that no other class may enter that code section).

 

Jim Dempsey

0 Kudos
JohnNichols
Valued Contributor III
4,449 Views

There are three or four procedures that have the same speed, suggesting they generate the minimum assembler - classaddarg

 

SUBROUTINE ClassArgAdd(A,B)
  CLASS (MyType) :: A
  CLASS (MyType) :: B
  A%Val = A%Val + B%Val
END SUBROUTINE

 

this one is the worst, no wonder it is multiple levels,  who said Syntactic Sugar is good.

 

 A2%Val = 0
 CALL CPU_TIME(TStart212)
 DO I=1,NB
   CALL A2%ClassArgEncapsTbpClassArgAdd(B2)
 END DO
 CALL CPU_TIME(TEnd212)
 WRITE (*,20) A2%Val


#if (group1 >= 2 || calc == 111 || calc == 112 || calc == 211 || calc == 212)
SUBROUTINE ClassArgEncapsTbpClassArgAdd(A,B)
  CLASS (MyType) :: A
  CLASS (MyType) :: B
  CALL A%ClassArgAdd(B)
END SUBROUTINE


SUBROUTINE ClassArgAdd(A,B)
  CLASS (MyType) :: A
  CLASS (MyType) :: B
  A%Val = A%Val + B%Val
END SUBROUTINE

 

this is the worst  by a factor of 5, with calls to 4th line, then 11th line then 14th line then 18th line and then 21st line - you have your five depth.

0 Kudos
andrew_4619
Honored Contributor III
4,438 Views

"who said Syntactic Sugar is good" I don't think dismissing OOP as Syntactic Sugar is at all fair comment. It is clear that there is  more work because  there is more data management under the hood than using plain old data (POD). It is down to the compiler vendors to keep improving optimisation but most of time the list of things that are most important about my code has speed at the bottom of the pile. Speed only moves up the pile when there is some specific big problem  in that area and after the more important things such as function,  robustness and clarity have been met.

jimdempseyatthecove
Honored Contributor III
4,378 Views

I ran your test program with group1=5 and group2==5, and NB set to your NB*10 (to run 10x longer)

and with two different optimizations one without /Qiop and one with /Qipo.

The one without is on top:

jimdempseyatthecove_1-1699629824929.png

 

/Qipo helps alot except in 111 and 112 in the type variables

and improves or degrades in the class variables. (degradations outweigh improvements)??

Curiously, the 111 and 112 of the type variables is slower than the 111 and 112 of the class variables in both the with /Qipo and without /Qipo not sure how to explain that.

 

This might be a good example for the compiler team to use for optimizing class/type procedures.

 

Jim Dempsey

 

 

0 Kudos
JohnNichols
Valued Contributor III
4,370 Views

Andrew:

I was not trying to annoy anyone, merely looking at the complexity of the different structures to do what is essentially the same thing.  In LISPING you learn simplicity and brevity.  LISP has no speed, Python has no speed for the sorts of stuff we do. 

Some of us can go for elegance and some of us are stuck with simplicity and speed.  I do a lot of engineering where the turnaround time on the Fortran code is a weekend, so whilst I am mindful of beauty; speed and simplicity are my curse words.  The people on this site have taught me a lot, @mecej4 teaches you how to keep it simple and modular without waste.  A human calculator of few words, always worth reading.  

Jim shows the deep thinking side, interesting and fun, but when you have to solve a Monte Carlo analysis problem of some depth, you all provide helpful answers, I assure you I do not get elsewhere.  

If you wonder if I annoy a lot of people, the answer is no, mainly because I keep to myself, although my 16 year old daughter called me a loser this morning as I dropped her and school, it is her way of saying Dad I love but shut up.  

Anyway, it is all just part of the human experience.  

John

 

0 Kudos
JohnNichols
Valued Contributor III
4,363 Views

Jim:

What is the algorithmic difference between the slowest and the quickest in real terms?

John

0 Kudos
jimdempseyatthecove
Honored Contributor III
4,361 Views

I assume/presume Class procedures are polymorphic, type procedures are not.

A polymorphic procedure need to disambiguate the incoming class objects. Compiler optimizations, may or may not be able to perform the disambiguation at compile time. If it can then the result ought to be equivalent to a type procedure.

When a class or type nests into (calls) another class procedure, see preceding paragraph.

 From your test program, it is apparent that if(when) at all possible, use type procedures as opposed to class procedures.

 

NOTE, pre-class, Fortran supported, and still does, generic interfaces. While you may need to write a little more code when you include/expand additional types, IMHO there is little difference in the amount of work. For generic interface, mostly a copy, paste and replace. For class, you insert a class(...) section in a select class and copy and paste. In both cases, the code that uses the class/(generic)type procedures is the same.

 

My preference for future Fortran language evolution is to implement templates. (but do it better than the C++ templates). An issue I have with C++ templates is more of an issue with the Linker (at least Microsoft's linker). The (bad) experience I've had is if(when) you do not make the template inline, then if the same expansion is performed in different source files, the linker complains about duplicate named objects. This has to be fixed.

Jim Dempsey

0 Kudos
ojacquet
Novice
4,323 Views

Thank you very much for your investigations.

I am aware that it takes courage and merit to address a problem that appears to be misrepresented.

That said, I, too, have spent a lot of time developing these tests, and trying to synthesize all the phenomena observed.

I could have created multiple small programs to illustrate each of the phenomena.

I preferred to keep everything in one program and define macros to select this or that part of the program using the preprocessor.

It is therefore essential to be able to use the preprocessor. Without it, you'll only see one side of the problems.

 

In particular, could you turn your attention to what I have called separate calculations?

That is, compile the test1.f90 program by giving the macro calc the value of the identifier of a calculation (101, 102, etc.) so that only the calculation 101 is performed, or only the calculation 102, etc.

You will be able to see that the separate calculations all have an optimal performance (with the ipo option) except for calculations 112 and 212 while they can be degraded within the grouped calculations (macro group1 or group2).

We can also see that even without using the ipo compilation option (only O2), we can have optimal performance for a large number of calculations.

Is there an explanation for this? Is this normal? Doesn't this reflect poor handling by the compiler in the case of "grouped" calculations?

 

Of course, you can try to reproduce these calculations yourself or look directly at Tables 1 and 2 (from my first post).

Table 1 gives the performance of separate calculations with different combinations of compilation options.

Table 2 shows the performance of grouped calculations with different combinations of compilation options.

To facilitate comparisons, the time unit has been changed so that the fastest calculation has a time of 1.

A time of 7 therefore means a calculation 7 times longer than the fastest calculation...

This is obviously intolerable for someone (as is my case) who develops calculations that can run over several hours or days.

 

Then, to push a little further, the group2 macro of the preprocessor allows you to do grouped calculations but with a reduced set of calculations.

We're interested in the only O2 compilation option.

You can of course try to reproduce these calculations yourself or look directly at table 3 (from the first post).

In particular, it can be seen that the mere change in the order of the calculations modifies the performance of the calculations. Compare the case group2=3 and the case group2=4: calculations 105 and 106 are performed after calculation 112. The computational time of calculation 106 goes from 3.4 to 1!

Do you think this is normal? Don't you see this as a compiler problem?

 

0 Kudos
JohnNichols
Valued Contributor III
4,311 Views

It is therefore essential to be able to use the preprocessor.

How?

0 Kudos
ojacquet
Novice
4,295 Views

John,

Do you work on Windows? I do not how precisely but I found that:

"You can explicitly run FPP in these ways:

  • On the ifort command line, use the ifort command with the -fpp (Linux and Mac OS) or /fpp (Windows) option. By default, the specified files are then compiled and linked. To retain the intermediate (.i or .i90) file, specify the -save-temps (Linux and Mac OS) or /Qsave_temps (Windows) option.

  • On the command line, use the FPP command. In this case, the compiler is not invoked. When using the FPP command line, you need to specify the input file and the intermediate (.i or .i90) output file. For more information, type FPP/HELP on the command line.

  • In the Microsoft Visual Studio* IDE, set the Preprocess Source File option to Yes in the Fortran Preprocessor Option Category.  To retain the intermediate files, add /Qsave_temps to Additional Options in the Fortran Command Line Category."

Olivier

0 Kudos
jimdempseyatthecove
Honored Contributor III
4,298 Views

>>A time of 7 therefore means a calculation 7 times longer than the fastest calculation...

This is obviously intolerable for someone (as is my case) who develops calculations that can run over several hours or days.

Then this may be an incentive to weigh the advantages vs disadvantages of using class driven procedures in exchange for tolerable performance.

 

FWIW As an optimization helper...

If (when) you have a section of code (using class procedures) that is performing poorly, and the code has some number of these same class procedures calls, see if the code section is amenable to the "peel the onion" technique. This is where you strip off the outer layers of the onion (nested class(s)) and pass them as argument to a contained procedure containing the same algorithm, but with the outer class layers removed.

 

An alternative way that can work in some cases is to use ASSOCIATE to construct a reference to an inner layer of nested class.

 

Barring the above, wait for the compiler development to correct the issue.

 

I tend to be impatient.

 

Jim Dempsey

0 Kudos
Reply