Speed loss using object oriented features !!!

Fahim_F_ · ‎06-28-2013

I am wondering hom much is the efficiency (or speed, CPU time) loose due to using object oreinted programming in fortran?

Does introducing long hierarchies of classes makes the execusion of code slower?

Is using polymorphic pointers costly in terms of slowing down the code execusion?

Here I made a very simple example (please see the attached file)

Consider a simple subroutine which inputs four double precision numbers and multiplies them. I would like to run this subroutine 10**7 times , where the subroutine is introduced differently (in different hierarchies of classes, etc) , and compare the CPU time for this process. Consider 5 cases:

1- Subroutine is intoduced in the main.f90 which is the main project file.

2- I make a class named A and intoduce the subroutine as a type-bound procedure of A.

3- I extend call A to AA and introduce subroutine as type-bound procedure of AA.

4- I introduce a class B which has a polymorphic pointer which points to AA .

5- I introduce the subroutine as a type-bound procedure of B as well. ( this is similar to 2).

I run the code for above cases and measure the execuation time of a loop which calls that subroutine 10**7 times,

here are the results: (on my pc, Core 2 Duo, 32 bit)

1- 2.67 seconds

2- 4.26 seconds

3- 4.27 seconds

4- 3.45 seconds

5- 4.26 seconds

Clearly defining the subroutine as a type-bound procedure slows down the execution performance !!! but is there a way around this cost?? other question is that why 4 is faster than 2, 3 and 5 ?

thank you for your inputs

FortranFan · ‎06-28-2013

Fahim,

Very interesting results.

Have you tried the same tests in another language, say C++, and checked how the results compare?

Because the conventional wisdom is indeed that object-orientation adds overhead and slows down CPU performance. So the fact that 2, 3, 4, and 5 are slower than 1 is not surprising. But there are no "rules of thumb" as far as I know on the performance deterioration, so it is unclear whether the factor of 1.6 you notice is par for the course or better or worse. But it will be useful to know whether a Fortran processor (compiler) is similar (or better) compared to any other in handling certain OO features, type-bound procedures (or class methods) in your case.

Your results for 4 are quite surprising - it would have made sense either if it took longer than 2, 3, and 5 due to polymorphic class element or it took about the same time as 2, 3, and 5 given that it is a pointer whose address doesn't change during your test and the optimization was smart enough to resolve it upfront.

TimP · ‎06-28-2013

I would think that if you used such features you would want to allow a compiler to remove dead code, but in your case there's nothing left: I suppose you used some compiler other than ifort and disabled optimization.

Windows being what it is, I'll have to reboot in order to quote ifort about your dead code. It looks like a spell checker should have been used before committing the messages:

<C:\source1.f90;-1:-1;IPO DEAD STATI
C FUNCTION ELIMINATION;TEST_ip_TIMETEST;0>
DEAD STATIC FUNCTION ELIMINATION:
(TEST_ip_TIMETEST)
Routine was called explicity

DELETED: MODULE_B_CLASS_mp_TIMETEST_B(6) (isz = 4) (sz = 17 (5+12))

DELETED: MODULE_AA_CLASS_mp_TIMETEST_AA(4) (isz = 4) (sz = 17 (5+12))

DELETED: MODULE_A_CLASS_mp_TIMETEST_A(2) (isz = 4) (sz = 17 (5+12))

DELETED: TEST_ip_TIMETEST(8) (isz = 5) (sz = 16 (5+11))

So, if your question is on whether these constructs prevent the compiler from deleting dead code, you have the answer. You should be able to find options such as disabling interprocedural optimization which avoid shortcutting all of them at otherwise normal compiler settings.

Fahim_F_ · ‎06-28-2013

Tim,

For the above test I used ifort but the optimization was disabled automagically, and I could not enable it although I tried with disabling the debugging mode, etc.

However, in a different test (with the same concept), I actually could turn the optimization on, however, the results were similar comapred to this one; meaning that, although with using the optimization option, the run time of each case significantly reduced (like 50%), however, case 1 was two to three fold faster than cases 2, 3, and 5.

However, can you comment on why case 4 is in fact faster than 2, 3, 4??!!!

Also, is it a way to have a similar perforamnce to case 1 with other cases, where, the function is a type-bound procedure,??

thank you very much

FortranFan,

This test wasnt meant to compare C++ and Fortran. However, as you mentioned, it would be very intereting. I am using fortran for a project, and I was hesitated to use these advanced features beacuase of their overhead cost. Do you have any suggestion (or solution) in mind to reduce the overhead costs regarding using object oriented implementation for the fortran codes?

Regards

jimdempseyatthecove · ‎06-28-2013

A couple comments on your code.

1) Add a 5th argument "res" to the enclosed functions. Accumulate the value into different caller's variables, then print out the results. The purpose of this is to coerce the compiler with optimizations enabled to .not. eliminate the function due to results not used. Note, inspect the execuitable to assure that the computation is performed. Do not run your performance test with /O0 as this is not a proper test of what a real application will observe.

2) More of an issue may be the argument of SOA over AOS. There is no one best technique.

Jim Dempsey

Bernard · ‎06-29-2013

I can not speak of Fortran object orientation, but in Java when comparing arithmetic operations on primitive types and the same operations on objects which represents for example double data type there will be present significant overhead.This is related to accessing the objects in the heap loading registers with the values of member fields performing operations on those values and writng back the results.I think that this example is too extreme to be taken into account.

IanH · ‎06-29-2013

If you apply the changes that Jim mentioned to make the test useful in the face of any reasonable level of optimisation, then for "standard" optimization levels I see no material difference between a direct procedure call and a call through a binding of a non-polymorphic object. I see a difference with calls through a binding of a polymorphic object - but theres's a fundamental difference in capability there - with the ability for the procedure to be dynamically selected based on the type of the object.

With interprocedural optimization enabled Isee no material difference between any of the methods I tried.

Outcomes will depend on the specific situation, but for something similar to the OP's example (simple arithmetic operation inside a tight loop) I see nothing to worry about.

FortranFan · ‎07-01-2013

IanH,

Cool piece of code!

With Modern Fortran, you show there are so many options and possibilities of procedure invocations with little performance penalty under ordinary situations.

Do you have any comments on when (and why) you use these different approaches?

If you are not already, you should consider writing a book on applications of Modern Fortran! :-)

Regards,

IanH · ‎07-01-2013

To be honest, the results surprised me a little. I wonder if I'm doing something silly with the test. Anyway, I'll cop out by emphasising the "outcomes will depend on the specific situation" part. Because the call is in a tight loop I think the overhead associated with the dynamically dispatched call becomes irrelevant - for the first call there might be a significant delay while the CPU tries to work out where on the planet it should get the next instruction from, but after that I suspect that caching and branch prediction logic and all that jazz mean that subsequent calls within a test are all but zero cost. This is at a hardware level that is rather beyond my level of knowledge.

After posting I had a look at the assembly generated by the compiler. This made me think that a) assembly is also rather beyond my level of knowledge, and b) the optimiser is smarter than me - it may have seen through some of my attempts to obfuscate things at compile time. As far as I can tell, with ipo it has pretty much inlined everything bar the procedure references through the polymorphic objects. What's interesting about this is that my use of a little dispatch table was an attempt to hand roll the way that I think the compiler implements bindings for polymorphic objects - the descriptor for the polymorphic object contains a pointer to a "vtable" for the dynamic type of the object that has the addresses of the specific procedures that implement the relevant bindings (for older compiler versions if the object is a polymorphic component then the descriptor should also contain a pointer to the premier support website...). The optimiser saw through and inlined my hand rolled dispatch table, but it didn't do the same for the compiler generated one.

Anyway - the OP needs to decide whether the possibility of a speed reduction is likely to be significant once they've enabled optimisation (measuring speed without some optimization is pointless) and if they are still worried about it, they need to test on a representative sample of their code. My test might not be relevant. My inclination is strongly to err on the side of code clarity and worry about optimization later, if at all, but that has got me into trouble occasionally. Now, speaking of trouble, I left the door to the office ajar this morning and the chickens got inside. Please excuse me - I have some cleaning to do.