I'm currently testing Intel C++ 13.0 for our uses, but coming across some strange performance issues. Compiling some of our unit tests results in a 50% speed drop as opposed to MSVC++.
Using Vtune, enabled me to find some machine clear issues eminating from our hot loops. Looking in more detail at the assembler output revealed that the std::vector array access weren't being inlined, but instead using a function call!
Heres the assembler output, I am using /O2 /Qip /Oi I've also tried /Ox
000000013FBDE189 mov rcx,qword ptr [rbp+120h]
000000013FBDE190 mov edx,dword ptr [yx]
000000013FBDE193 add edx,dword ptr
000000013FBDE196 movsxd rdx,edx
000000013FBDE199 call std::vector<float, std::allocator<float> >::operator (013FB76880h)
000000013FBDE19E mov r12,rax
000000013FBDE1A1 mov rcx,qword ptr [rbp+120h]
000000013FBDE1A8 mov edx,dword ptr [yx]
000000013FBDE1AB add edx,dword ptr
000000013FBDE1AE movsxd rdx,edx
000000013FBDE1B1 call std::vector<float, std::allocator<float> >::operator (013FB76880h)
Any idea why the compiler is using a call instead of a direct array access here ? Its just a std::vector<float> object.
I've been trying loads of different compiler options to try and somehow get it inlined, but to no avail. Can anyone help ?
As you indicated, there are quite a few options to adjust in-lining limits in ICL (mostly visible in ICL /help). I don't have personal experience with std::allocator but I suppose it may require an external library call even if you succeed in in-lining the stl template, so it may be difficult for the compiler to make exactly the best choice for you. In case of doubt, the compiler may choose limits which avoid excessive in-lining, compile time memory consumption, and code expansion.
ICL 14.0 has introduced more aggressive optimization for some cases of STL; I don't know if that would apply here. The release should be available within a few days.
If you require a professional opinion, you may need to provide a reproducer which shows how MSVC++ is more efficient for you.
It shouldn't require an external function call. MSVC manages to inline the vector access just fine. I've noticed the same thing happening with other small functions as well, which are marked inline in headers, yet are appearing as actual calls inside very hot loops. Only tiny template functions.
Is there a way to get the diagnostics to reveal why a function wasn't inlined ? I tried looking at the Interprocedural Optimizer Phase diagnostics, but it never mentioned anything about the offending functions.
I will give ICL 14 a go when it is released.
Try using forceinline as opposed to inline. I had a similar issue with this too. Depending on some unknown factor the compiler would elect or not to inline. forceinline fixed this for me.
Note, the agressive optimizations, including inline, now seem to require IPO enabled.
Btw, all std vector's are defined as taking an allocator. It defaults to the standard allocator by default. This is the same for all std container classes AFAIK. (see 2nd param @ http://msdn.microsoft.com/en-us/library/vstudio/9xd04bzs.aspx)
I will sort out some example code in the morning, but its nothing out of the ordinary. Just delcare a vector, reserve and access elements with [ ].