Community
cancel
Showing results for 
Search instead for 
Did you mean: 
AndrewC
New Contributor I
102 Views

Compiled XE 2013.0 Update 1 generates slower code than 12.1

After switching from 12.1 to Composer XE 2013 ( Update 1, Windows 64-bit) I am seeing a consistent 10-15% slowdown across the board( code is built and  benchmarked on a Quad Core Xeon). C++ Code compiled /O3, no auto-parellization.

Is this a known issue to be fixed in an update?

0 Kudos
32 Replies
AndrewC
New Contributor I
99 Views

I have run Amplifier XE 2013 and profiled the code. It appears XE 2013 is NOT inlining a simple function that 12.1 inlined. Compiler option is /Ob2

[cpp]

      template <class T>
      inline T Matrix::operator()(int i, int j) const
       {
        return data()[i*rowstep+j*colstep];
      }
[/cpp]

AndrewC
New Contributor I
99 Views

Let me rephase this, Compiler 13.0 is not inlining the function when the operator(int ,int) is used many times in a complex expression, while 12.1 would seemingly be inlining the function. In the following expression ( this is auto-generated code),13.0 is  seems to be generating function calls for operator (int,int) rather than in-line code, even at /O3 /Ob2

[cpp]

        result(0,0)=(-(Z(2-1,4-1)*Z(3-1,3-1)*Z(4-1,2-1)) + Z(2-1,3-1)*Z(3-1,4-1)*Z(4-1,2-1) + Z(2-1,4-1)*Z(3-1,2-1)*Z(4-1,3-1) - Z(2-1,2-1)*Z(3-1,4-1)*Z(4-1,3-1) - Z(2-1,3-1)*Z(3-1,2-1)*Z(4-1,4-1) +
        Z(2-1,2-1)*Z(3-1,3-1)*Z(4-1,4-1))*tmp;

[/cpp]

SergeyKostrov
Valued Contributor II
99 Views

>>...After switching from 12.1 to Composer XE 2013 ( Update 1, Windows 64-bit) I am seeing a consistent 10-15% slowdown... My first question to myself was how is it possible to have so significant slowdown? However after seeing you expression I'm really concerned since I also use lots of templates and different C++ operators ( declared as inline ) to simplify processing. So, my question: Is it a feature or a bug with the latest version of Intel C++ compiler?
Georg_Z_Intel
Employee
99 Views

Hello, with comparing major compiler versions it's not a surprise you see differences in optimization. We have a conservative release model that some people like, others don't: Only major version updates (12.x -> 13.x) are subject of bigger changes. The update releases in between should not show much performance variations but also offer less features. I'm not surprised you see a change, moving from 12.1 to 13.0. You tweaked your compiler option set for 12.1 either by intention or indirectly. So, it works best with 12.1 and must not with any other major version. In-lining is a complicated topic and heuristics used are, by definition, never optimal for all scenarios. The following can help getting the function above in-lined again: - Use IPO (try both, single-file & multi-file) - Try /Qinline-forceinline so "inline" keywords force in-lining (as long as in-lining limits are not exceeded) - Use #pragma forceinline - Slighlty(!) increase the inlining factor (/Qinline-factor) or the other in-lining limits - Also give /O2 a chance - /O3 is not always better Best regards, Georg Zitzlsberger
JenniferJ
Moderator
99 Views

vasci_ wrote:

Let me rephase this, Compiler 13.0 is not inlining the function when the operator(int ,int) is used many times in a complex expression, while 12.1 would seemingly be inlining the function. In the following expression ( this is auto-generated code),13.0 is  seems to be generating function calls for operator (int,int) rather than in-line code, even at /O3 /Ob2

        result(0,0)=(-(Z(2-1,4-1)*Z(3-1,3-1)*Z(4-1,2-1)) + Z(2-1,3-1)*Z(3-1,4-1)*Z(4-1,2-1) + Z(2-1,4-1)*Z(3-1,2-1)*Z(4-1,3-1) - Z(2-1,2-1)*Z(3-1,4-1)*Z(4-1,3-1) - Z(2-1,3-1)*Z(3-1,2-1)*Z(4-1,4-1) +         Z(2-1,2-1)*Z(3-1,3-1)*Z(4-1,4-1))*tmp;

Is is possible to send a testcase?  It's better to find out why.

Also can you check this report: "/Qopt-report-phase:ipi /Qopt-report-routine:the_func_name". does it say why it is not inlined?

Jennifer

SergeyKostrov
Valued Contributor II
99 Views

>>...Also can you check this report: "/Qopt-report-phase:ipi /Qopt-report-routine:the_func_name". does it say why it is not inlined? Thanks, Jennifer. Could you try a different test-case? ... const float Pi = 3.141592653589793; inline float area( const float r ) { return ( Pi * r * r ); } void main( void ) { printf( "The area is: %f\n", area( area( area( area( area( 2.0 ) ) ) ) ) ); } ...
AndrewC
New Contributor I
99 Views

I'm lookoing into producing a report of why it is not inlined. When using /Qopt-report-routine:the_func_name, how do you specifiy a C++ template operator () as "the_func_name".

Georg_Z_Intel
Employee
99 Views

Hello, please use the mangled names you already find in the /Qopt-report output. Edit: To give an example, lets assume a template function "foo", defined like this... [cpp] template <class T> T foo(T x) { ... } [/cpp] The mangled name for foo<int> would look like ??$foo@H@@YAHH@Z, for foo<float> like ??$foo@H@@YAMM@Z, etc. There's a nice on-line C++ demangler; just look for c++filtjs. Best regards, Georg Zitzlsberger
AndrewC
New Contributor I
99 Views

Using /Qopt-report there is a significant difference between 12.1 and 13.0 when inlining this function.

Just to make sure there is no confusion. This seems to be a very specific issue with under a very specific circumstances. Once this routine was "fixed" the performance of our benchmarks using 13.0 vs 12.1 was similar , if not better.

jimdempseyatthecove
Black Belt
99 Views

What happense when you use

...
inline T Matrix::operator()(const int i, const int j) const
...

Also, several months ago I han an issue where inline would not inline, however replacing with forceinline did work.
Then later, inline would work again. Never figured out what triggered the behavior.

Jim Dempsey

AndrewC
New Contributor I
99 Views

I am looking at the inlining of the expressions that use z(i,j). Class T is a "DComplex" (DP Complex)

      template <class T>       inline T Matrix::operator()(int i, int j) const        {         return data()[i*rowstep+j*colstep];       }

FYI, undecoration of functions....

Undecoration of :- "??R?$RWGenMat@VDComplex@@@@QEBA?AVDComplex@@HH@Z"
is :- "public: class DComplex __cdecl RWGenMat<class DComplex>::operator()(int,int)const __ptr64"
Undecoration of :- "?data@?$RWGenMat@VDComplex@@@@QEBAPEBVDComplex@@XZ"
is :- "public: class DComplex const * __ptr64 __cdecl RWGenMat<class DComplex>::data(void)const __ptr64"
Undecoration of :- "??0DComplex@@QEAA@AEBV0@@Z"
is :- "public: __cdecl DComplex::DComplex(class DComplex const & __ptr64) __ptr64"

12.1 appears to  inline all three line functions in a call to z(i,j)

-> INLINE (MANUAL): ??R?$RWGenMat@VDComplex@@@@QEBA?AVDComplex@@HH@Z(751) (isz = 12) (sz = 25 (5+20))
1>      -> INLINE (MANUAL): ?data@?$RWGenMat@VDComplex@@@@QEBAPEBVDComplex@@XZ(753) (isz = 0) (sz = 6 (2+4))
1>      -> INLINE (MANUAL): ??0DComplex@@QEAA@AEBV0@@Z(752) (isz = 3) (sz = 12 (4+8))

In particular, 12.1 reports 378 inlines of DComplex const * __ptr64 __cdecl RWGenMat<class DComplex>::data(void)const __ptr64

13.0 does not report ANY inlines of this function. This seems to be what I am seeing ( performance drop due to no-inline )

13.0 reports something a bit "odd", that is not seen in the 12.1 report....is this a clue?
1>  IPO DEAD STATIC FUNCTION ELIMINATION;?data@?$RWGenMat@VDComplex@@@@QEBAPEBVDComplex@@XZ;0>
1>  DEAD STATIC FUNCTION ELIMINATION:
1>    (?data@?$RWGenMat@VDComplex@@@@QEBAPEBVDComplex@@XZ)
1>    Routine is dead extern
1>




SergeyKostrov
Valued Contributor II
99 Views

Hi everybody, Here are a couple of notes: Note 1: vasci_, Did you check the report for Debug or Release configurations? Note 2: I see that a call to another indexing C++ operator data()[ offset ] is used in: ... template inline T Matrix::operator()(int i, int j) const { return data()[i*rowstep+j*colstep]; } ... I would use a pointer to a data set directly without calling the additional indexing C++ operator.
AndrewC
New Contributor I
99 Views

- This is release configuration

- not sure what you are referring to when you say "I would use a pointer to a data set directly without calling the additional indexing C++ operator.".

data() returns a "raw" pointer. The line of code data()[i*rowstep+j*colstep] is a "C" array operation. That is , simple pointer arithmetic. I am sure the compiler can deal with that.

AndrewC
New Contributor I
99 Views

Anway, the bottom line is 12.1 vs 13.0

  • Identical code
  • Identical compiler options
  • different inlining results for a complex expression that negatively affect the performance of my code.

I will try an bundle this up in an acceptable way for Premier Support.

SergeyKostrov
Valued Contributor II
99 Views

A duplicate post removed. Sorry about that and I'm not sure why it happens from time to time.
SergeyKostrov
Valued Contributor II
99 Views

Thanks for the feedback. >>...- not sure what you are referring to when you say "I would use a pointer to a data set directly without calling the additional >>indexing C++ operator.". Here is an example: ... template < class T, ... > class TDataSet { public: ... inline T * operator[]( RTint iIndex ) { ... return ( T * )m_ptData2D[ iIndex ]; }; ... private: ... T **m_ptData2D; ... };
Georg_Z_Intel
Employee
99 Views

Hello, I've finally reproduced the problem and forwarded to engineering (DPD200242982). As soon as I learn more I'll let you know. Thank you & best regards, Georg Zitzlsberger
AndrewC
New Contributor I
99 Views

Great! Thanks for the effort to track this down!

Georg_Z_Intel
Employee
99 Views

Hello, engineering just added an improvement for Intel(R) Composer XE 2013 SP1, which is currently BETA. Initial release will be available end of this year. 13.x won't be get this improvement because we only do stability fixes there right now. There's also a "workaround": If you use /Qip you should get it in-lined with the current 13.x compilers as well. Best regards, Georg Zitzlsberger
SergeyKostrov
Valued Contributor II
29 Views

Hi Georg, I wonder if Intel software engineers do regular performance evaluations of the Intel C++ compilers against another C++ compilers ( for different platforms ) for at least the most important /O-like options, like /O2. In a set of my recent tests Intel C++ compiler did not perform well when compared to some older C++ compilers. Unfortunately, I can't provide any sources of these tests ( more than 2,500 code lines in total ) but a final result is as follows: 1. MinGW version 3.4.2 ( winner / released in 2004 / outperformed by ~5% ) 2. Microsoft C++ compiler versions of VS 2005 & VS 2008 3. Intel C++ compiler versions v12.x & v13.x 4. Borland C++ compiler version 5.x Note: Optimize for speed option /O2 was used in all test cases for a recursive matrix multiplication algorithm / single-precision floating point data types. I understand that my information is too generic and doesn't provide enough technical details but believe me that code efficiency of MinGW C++ compiler is really good (!).
Reply