My application contains a spare linear system solver which relies heavily on the fabs() function. One of the biggest surprises I have gotten from VTune on Linux is how significant a fraction of my runtime is spent in __fabs. In a recent profile run I found that 1 billion calls were made to this function with a Self Time of 126 million. This is about as much time as is being spent doing other things in the linear solve, which I find outrageous.
I should point out that I have profiled this code on many other platforms, e.g. SGI, and __fabs has never popped on my radar. My only recollection is that a colleague once told me that he had seen fabs show up on an NT, but I don't remember how bad it was.
I am using GCC 3.2 and my app is C++.
Can anyone help me understand this? Why is __fabs even showing up as a function call? Would it be unreasonable to assume that this could be in-lined? If my NT recollection is correct, then this raises the question whether this is an x86 thing. Do MIPS chips have some kind of fabs hardware which Intel lacks.
If you read the gcc documentation (info gcc), you will see that fabs() is included among the built-in functions, and there are ways to enable or disable recognition of built-ins. __builtin_fabs is implemented both for x87 code and for SSE2 (gcc -march=pentium4 -mfpmath=sse). 'info gcc' doesn't cover the likely possibility that your include files are broken.
The usual glibc setup has invoking a file of in-line definitions under certain circumstances, including the level of optimization. The usual location of this file is /usr/include/bits/mathinline.h In there, you will find macros for fabs() . If invoked, those could over-ride the recognition of built-ins. Many linux distros carry incorrect versions, presumably for reasons of historical continuity.
Among the reasons for not in-lining fabs(), which you will see alluded to in the documentation, are that C90 was the first to treat fabs as a reserved identifier, and the (slight, in this case) possibility that a programmer might attempt checking.
So, in a long winded way, I've given the usual advice, if all else fails, examine your pre-processed code and consult the documentation.
Please help me understand some of the terminology. For maximum performance, is "built-in" a desirable thing? If I see __fabs in VTune, does this mean that I have *not* gotten the more desirable __builtin_fabs. I am compiling with -march=pentium4, on RedHat9.
I am compiling with -ansi. Is that what is biting me?
In math.h I see two interesting defines: __NO_MATH_INLINES and __USE_EXTERN_INLINES. Could you give me an idea about the difference between math-inlines and extern-inlines? What is going to give me maximum performance? I am not concerned about errno, because according to the man page and common sense, no errors can occur in fabs.