Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Inlining on Linux

ericrobert
Beginner
675 Views
Hello,

I have a situation where the compiler do not inline trivial functions causing a major slowdown. I am using the Linux 32 bits compiler version 12.1.
When I run VTune, I can see the unoptimized code e.g. simple things that look like this:
template
class Vector {
T* m;
public
Toperator()(int i) { return m; } // should be inline
};
If I compile the faulty code in a single file sample with the same compiler options, I get much better performances and I can see in VTune that everything is inlined.
I tried -O3,-inline-forceinline, -ip, -ipo and most options in the 'inlining' section of the compiler help.
I also tried to tag functions with __inline and using#pragma forceinline recursive
I'm running out of ideas ;)
Any suggestions?
Thanks!
ric
0 Kudos
9 Replies
ericrobert
Beginner
675 Views
Just to add more information:

I generated the optimization report (-opt-report 3) and it seems that there is a pattern.All calls to functions (for example_ZN6VectorIdLi2EEC1Edd) that are not inlined get something like:
-> _ZN6VectorIdLi2EEC1Edd(458) (isz = 6) (sz = 15 (5+10))
[[ Callee not marked with inlining directive or pragma ]]
For reference, I think (from the mangled name) this is a trivial constructor (with 2 assignments) defined directly within the class body. I added the 'inline' keyword to be sure. I would expect that to be inline.
I attached the report if it helps.
Thanks!
ric
0 Kudos
Georg_Z_Intel
Employee
675 Views
Hello ric,

there are cases where inlining won't be done. The compiler decides upon different data whether inlining can be beneficial or not. This involves thresholds that, once exceeded, indicate the compiler to not inline certain portions of code. At this point it ignores requests for additional inlining. In your case you might want to increase the thresholds.
Before I'm pointing you to the corresponding options I'd like to give a warning: Changing the inlining, and esp. forcing it, can have (negative) side-effects on
- register pressure
- caching and
- code size

Hence, changing thresholds has impact on any aspect mentioned above. In some cases you'll see some improved performance but in (most) other cases you'll see massive slowdowns.

All inlining thresholds are globally increased (or decreased) using this option:
Linux:
-inline-factor=n

Windows:
/Qinline-factor=n

Use n > 100 to increase all inline thresholds (e.g. -inline-factor=150). Start with this option first till you see the desired effect.

Once you saw the desired effect (as a proof) you might omit it and fine-tune the different thresholds using some of those options (see latest manual http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/win/copts/common_content/options_ref_bk_inlining.htm):

Linux:
-inline-max-per-compile=n
-inline-max-per-routine=n
-inline-max-size=n
-inline-max-total-size=n
(-inline-min-size=n)

Windows:
/Qinline-max-per-compile=n
/Qinline-max-per-routine
/Qinline-max-size=n
/Qinline-max-total-size=n
(/Qinline-min-size=n)

All mentioned options are only documented for Compiler XE.

I hope this works for you.

Best regards,

Georg Zitzlsberger
0 Kudos
jimdempseyatthecove
Honored Contributor III
675 Views
George,

You are picking a stock answer for inlining issues without reading Eric's post.
Eric had a template returning an array element as opposed to being a loop (what your stock answer relates to).

Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
675 Views
Eric,

Perhaps you could try:

T&operator()(int i) { return m; }

Jim Dempsey
0 Kudos
Georg_Z_Intel
Employee
675 Views
Hello Jim,

the options I've listed above are not related to or mentioning loops at all.
ric wants "Toperator()(int i) { return m; }" to be inlined but it seems that the inlining-threshold was exceeded. With the options listed above the threshold(s) can be tweaked to force inlining, though.

Best regards,

Georg Zitzlsberger
0 Kudos
jimdempseyatthecove
Honored Contributor III
675 Views
George,

//ric wants "Toperator()(int i) { return m; }" \\

In most cases returning the refererence (T&) is just as functional. ric will have to be the judge of this.

When T is a fundamental type (char, short, word, int, float, double, void*, ...) then

"Toperator()(int i) { return m; }"

Will likely get inlined.

However, when T is a struct or class then

"Toperator()(int i) { return m; }"

often creates code to execute a copy operator (either default or user supplied)

When T is a large struct/class then the overhead of the non-inlined function call would be low.
When T is a small struct/class then the overhead of the non-inlined function call would be high.

Additionally, in some/many situations the copy operation may be unnecessary but done anyway.

Whereas "T&operator()(int i) { return m; }" returns a reference(pointer) to m, which can always be inlined, then the compiler can decide to use it directly (when copy not require)or as an argument to the copy operator (when copy is required).

ric could comment on this.

Jim Dempsey
0 Kudos
airborne18th
Beginner
675 Views
Jim,

I am new to the Intel compiler, so I am still trying to get a grip on the optimizations.

But your point on the copy operation brings me to this.

Whether T operator()(int i) generates an implied call to a copy constructor or copy operator was typically a function of the optimization level of the compiler. I would expect that the compiler by default would generate a copy operation. However, I would also expect that specifying a higher level of optimzation, the compiler should inline the method and determine the copy was not needed.

0 Kudos
Brandon_H_Intel
Employee
675 Views
Hi all,

__forceinline on that operator function definition should do what you want.

That being said, I'm not sure what's meant by saying that "when it's compile in a single file" it inlines. That class definition should be in an included header file which is always part of every single file where that would be inlined. Perhaps I'm missing something in the description there.
0 Kudos
Georg_Z_Intel
Employee
675 Views

Hi,

staying with the return by value version, a copy of the value is returned:
If "T" is a POD (Plain Old Data) type there's no copy-ctor - returning a copy is trivial for the compiler and the resulting code generated can be high likely inlined.
If "T" is a class/struct type the C++ standard requires calling its copy-ctor whenever a copy of it is returned. Depending on the complexity of the underlying structure (inheritance hierarchy, amount of data members) the copy-ctor can be quite complex, and can even call other copy-ctors as well. The benefit of inlining such copy-ctors might negatively effect performance and so the compiler will decide to not doing it. If theres no negative effect the compiler will do.
However, if you still want to force inlining and the compiler does not do what you want you have to increase the thresholds. Pragmas, options and specifiers (e.g. __inline) might only work if the thresholds arent reached yet.

As Jim already pointed out you can also change the semantic to return by reference or even by address to avoid eventual copy-ctor overhead. Downside is that the remaining code needs to be aware of that - operations on copy or on the original value stored in a container might be quite different here. Under the line thats a design change with all the side-effects.

Best regards,

Georg Zitzlsberger

0 Kudos
Reply