Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

"force inline" doesn't?

Patrik_Jonsson
Beginner
702 Views
Hi,

I'm trying to optimize C++ expression template code, and for the compiler to be able to optimize away the expression templates, they need to be inlined. At O3, the compiler will by default punt on inlining complicated expressions, leaving some intermediate function calls, and killing performance.
As I understand it, the function-specific directive "__forceinline" and the statement-specific "#pragma forceinline recursive" should force the compiler to inline the function call, but I've tried using these and it still leaves the function calls. (It does read them, because if I intentionally mis-spell them, I get an error.) Using the compiler option "-inline-forceinline" doeswork, so the compiler is technically capable of inlining the call, but this of course inlines the entire code which is not usable in practice.
Can anyone give any hints as to what might be preventing the inlining of the calls? This is w icpc 11.1.
Thanks,
/Patrik
0 Kudos
9 Replies
Dale_S_Intel
Employee
702 Views
I have a few questions for you. You mention -O3, do you mean that it inlines better at -O2?


As you allude, there are times when __forceinline may not result in an inlined function, but I wouldn't expect that to be affected by -inline-forceinline, which is supposed to treat "inline" suggestions as "__forceinline"s, as I understand it at least.

Is the particular function that gets inlined when you throw the switch actually a __forceinline function, or a function that is called by that __forceinline function? I.e. I'm trying to understand how the #pragma fits in here.

In any case, is it feasible for you to try the latest compiler (12.0 aka Composer 2011 XE) and see if it works better for you?
Of course, a reproducible test case would be the most helpful, that way we could determine for sure if it's a bug and try to get a fix for it.

Thanks!
Dale
0 Kudos
Patrik_Jonsson
Beginner
702 Views
Hi Dale,
Thanks for responding.
>You mention -O3, do you mean that it inlines better at -O2?
Sorry for being unclear. No, I meant at -O3 but without any further specific inlining options. I didn't try lower optimization levels as I didn't think that would improve anything.
>Is the particular function that gets inlined when you throw the switch actually a __forceinline function,
> or afunction that is called by that __forceinline function? I.e. I'm trying to understand how the
> #pragma fits in here.
So I tried both. I declared the function __forceinline, or I used the #pragma at the call site (and higher up in the call tree, using "recursive"). Neither did seem to inline the function.
>is it feasible for you to try the latest compiler (12.0
I did actually do that, but the problem is that on that platform they have an old version of gdb that makes reading the assembly code all but impossible (it insists on printing the function name on each line, which with the heavily templated expression template functions makes it unreadable. Incidentally, idb does the same thing, so if you could file a bug report about that, that would be great. The output from disas looks like this:
[bash]0x000000000041a760 ::T_array>::T_expr,blitz::asExpr<:ARRAY>::T_array>::T_expr,blitz::Multiply<:_BZ_ARRAYEXPR><:FASTARRAYITERATOR><:ARRAY>::T_numtype,1> >::T_optype,blitz::_bz_ArrayExpr<:FASTARRAYITERATOR><:ARRAY>::T_numtype,1> >::T_optype> > > >::T_expr,blitz::asExpr<:_BZ_ARRAYEXPR><:_BZ_ARRAYEXPRBINARYOP><:ASEXPR><:ARRAY>::T_array>::T_expr,blitz::asExpr<:ARRAY>::T_array>::T_expr,blitz::Multiply<:_BZ_ARRAYEXPR><:FASTARRAYITERATOR><:ARRAY>::T_numtype,1> >::T_optype,blitz::_bz_ArrayExpr<:FASTARRAYITERATOR><:ARRAY>::T_numtype,1> >::T_optype> > > >::T_expr,blitz::Add<:_BZ_ARRAYEXPR><:_BZ_ARRAYEXPRBINARYOP><:ASEXPR><:ARRAY>::T_array>::T_expr,blitz::asExpr<:ARRAY>::T_array>::T_expr,blitz::Multiply<:_BZ_ARRAYEXPR><:FASTARRAYITERATOR><:ARRAY>::T_numtype,1> >::T_optype,blitz::_bz_ArrayExpr<:FASTARRAYITERATOR><:ARRAY>::T_numtype,1> >::T_optype> > >::T_optype,blitz::_bz_ArrayExpr<:_BZ_ARRAYEXPRBINARYOP><:ASEXPR><:ARRAY>::T_array>::T_expr,blitz::asExpr<:ARRAY>::T_array>::T_expr,blitz::Multiply<:_BZ_ARRAYEXPR><:FASTARRAYITERATOR><:ARRAY>::T_numtype,1> >::T_optype,blitz::_bz_ArrayExpr<:FASTARRAYITERATOR><:ARRAY>::T_numtype,1> >::T_optype> > >::T_optype> > > >&)>:       pushq  %r14
0x000000000041a762 ::T_array>::T_expr,blitz::asExpr<:ARRAY>::T_array>::T_expr,blitz::Multiply<:_BZ_ARRAYEXPR><:FASTARRAYITERATOR><:ARRAY>::T_numtype,1> >::T_optype,blitz::_bz_ArrayExpr<:FASTARRAYITERATOR><:ARRAY>::T_numtype,1> >::T_optype> > > >::T_expr,blitz::asExpr<:_BZ_ARRAYEXPR><:_BZ_ARRAYEXPRBINARYOP><:ASEXPR><:ARRAY>::T_array>::T_expr,blitz::asExpr<:ARRAY>::T_array>::T_expr,blitz::Multiply<:_BZ_ARRAYEXPR><:FASTARRAYITERATOR><:ARRAY>::T_numtype,1> >::T_optype,blitz::_bz_ArrayExpr<:FASTARRAYITERATOR><:ARRAY>::T_numtype,1> >::T_optype> > > >::T_expr,blitz::Add<:_BZ_ARRAYEXPR><:_BZ_ARRAYEXPRBINARYOP><:ASEXPR><:ARRAY>::T_array>::T_expr,blitz::asExpr<:ARRAY>::T_array>::T_expr,blitz::Multiply<:_BZ_ARRAYEXPR><:FASTARRAYITERATOR><:ARRAY>::T_numtype,1> >::T_optype,blitz::_bz_ArrayExpr<:FASTARRAYITERATOR><:ARRAY>::T_numtype,1> >::T_optype> > >::T_optype,blitz::_bz_ArrayExpr<:_BZ_ARRAYEXPRBINARYOP><:ASEXPR><:ARRAY>::T_array>::T_expr,blitz::asExpr<:ARRAY>::T_array>::T_expr,blitz::Multiply<:_BZ_ARRAYEXPR><:FASTARRAYITERATOR><:ARRAY>::T_numtype,1> >::T_optype,blitz::_bz_ArrayExpr<:FASTARRAYITERATOR><:ARRAY>::T_numtype,1> >::T_optype> > >::T_optype> > > >&)+2>:     sub $0x90, %rsp
[/bash]
(incidentally, that is the name of the function that should be inlined but isn't. In less complicated cases, it does get inlined, so it's not trivial to construct an isolated test case.)
So I just tried it with 12.0.3, and this is what happens:
at default O3, the above operator= is not inlined. Adding __forceinline to this function does inline it, but since it is just a forwarding function, this only leads to another immediate call to "_bz_evaluate". If I now try too make the compiler inline this function, too, either by declaring _bz_evaluate __forceinline or by using #pragma forceinline recursive at the call site in operator=, it reverts to not inlining operator=. In the original case, when operator= was not inlined, _bz_evaluate wasinlined. So it's like the compiler is hell bent on not inlining one of the functions.

Regards,
/Patrik
0 Kudos
Greeshma_Y_Intel
Employee
702 Views
Hi Patrick

Thanks for the clarification. I would like some more informationabout the above case.

When you declared _bz_evaluate __forceinline,was the__forceinline added to the the operator= function as well?

Also, can you generate and send an opt-report using the following commands:
Depending on your platform:


On Windows:

-Qopt-report:3 -Qopt-report-phase:ipo_inl

On Linux:

-opt-report 3 -opt-report-phase=ipo_inl

Thanks
Greeshma

0 Kudos
Patrik_Jonsson
Beginner
702 Views
Hi Greeshma,
Sorry for the delay, I got sidetracked.
Yes, with operator= declared __forceinline, it was inlined. When __forceinline was added to _bz_evaluate, which is called by operator=, operator= was un-inlined while still being declared forceinline.
I've attached the opt-report from a simplified example to make it easier to sift through. Making the program less complicated changed the inlining behavior slightly. This is now the situation:
operator= calls _bz_evaluate calls _bz_evaluator<1>::evaluateWithStackTraversal.
  1. __forceinline on the first two inlines them correctly, but not the third which is just declared plain inline. This is inline-report1.txt
  2. *Adding* __forceinline on the third un-inlines operator=. This is inline-report2.txt.
  3. Declaring the second and third plain inline but setting #pragma forceinline recursive on the call to _bz_evaluate in operator= results in _bz_evaluate inlined but not evaluateWithStackTraversal. This is inline-report3.txt. In this case, there are also several un-inlined calls deeper in the call stack, which are even more important to inline, so it's not just a single anomalous case.
  4. Using compilation option -inline-forceinline on situation 1 results in a completely inlined evaluation with no remaining calls. This is inline-report4.txt.
Just so you can find them, the complete signature of these three functions are:
_ZN5blitz5ArrayIdLi1EEaSINS_13_bz_ArrayExprINS_21_bz_ArrayExprBinaryOpINS3_INS4_INS3_INS_17FastArrayIteratorIdLi1EEEEES7_NS_8MultiplyIddEEEEEESB_NS_3AddIddEEEEEEEERS1_RKNS_6ETBaseIT_EE
_ZN5blitz12_bz_evaluateINS_5ArrayIdLi1EEENS_13_bz_ArrayExprINS_21_bz_ArrayExprBinaryOpINS3_INS4_INS3_INS_17FastArrayIteratorIdLi1EEEEES7_NS_8MultiplyIddEEEEEESB_NS_3AddIddEEEEEENS_10_bz_updateIddEEEEvRT_T0_T1_
_ZN5blitz13_bz_evaluatorILi1EE26evaluateWithStackTraversalINS_5ArrayIdLi1EEENS_13_bz_ArrayExprINS_21_bz_ArrayExprBinaryOpINS5_INS6_INS5_INS_17FastArrayIteratorIdLi1EEEEES9_NS_8MultiplyIddEEEEEESD_NS_3AddIddEEEEEENS_10_bz_updateIddEEEEvRT_T0_T1_
Regards,
/Patrik
0 Kudos
Brandon_H_Intel
Employee
702 Views
I'll let Dale and Greeshma handle this further, but I wanted to just add for clarification that #pragma forceinline recursive was only introduced in the 12.0 / C++ Composer XE 2011 compiler, so 11.1 would not recognize it regardless.
0 Kudos
Patrik_Jonsson
Beginner
702 Views
Ok, well that's good to know. And yes, the above files were generated by 11.1. I'm surprised it doesn't warn about unrecognized pragma, but now that I tried it, I can write pragma whatever and it doesn't complain, even on -Wall.
0 Kudos
Brandon_H_Intel
Employee
702 Views
I'm not sure what the problem is here, Patrik. If I compile a simple file with that pragma with 11.1, I get:

[hello]$ icpc -c hello.cpp

hello.cpp(15): warning #161: unrecognized #pragma

#pragma forceinline recursive

^

0 Kudos
Dale_S_Intel
Employee
702 Views
That's odd. Do you have access to a 12.0 compiler to try?
Dale
0 Kudos
Patrik_Jonsson
Beginner
702 Views
Hi Dale,
I do have access to 12.0.3, and I can try to replicate the test with that.
However, I'm having more serious issues. 12.0 seems to take a lot longer to compile, to the point of effectively failing to compile some of the more complicated examples of this template machinery. With 11.1.046, the more complicated example takes 30s to compile. With 12.0.3, I killed the compilation after 18 hours, as it still hadn't completed. Are there some known issues with 12.0 and heavy template use? There are no messages whatsoever from the compiler (and while it was using 33gb of memory, the machine has plenty so it was not swapping.)
0 Kudos
Reply