I'm evaluating ICC 11.1.054 for compiling our large C++ application. I was actually wanting to compare the OpenMP implementation with that of MSVC 2005, but when compiling our application and running some test cases I see a significant slowdown (1.7x and 3.1x) in two important regions of our code.
Other important regions are showing a (smallish) speedup under ICC, which is what I expected.
The only difference I can see between the regions that are faster and those that are slowed down is that the latter are very heavy on templates and rely on significant inline expansion etc. to produce efficient code.
Have other people seen this? Can anyone suggest a solution?
Under MSVC 2005 we compile with: -O2 -Oi -Ot -Oy -Ob2
I have tried this with ICC, and also tried /Ox and (just) /O2, without any variation in timings.
I appreciate any help.
This may be hard for you to do. Start with procuding an assembly listing (with hex code) of the template expansions. You can compare the MSVC and ICC listings to see where the code differs.
Often you will find there is a problem in your templates that is exposed within one or the other compilers. Often this relates to not using "const" in the template when appropriate. You may find other syntaxes in the templates that will improve performance (for both compilers).
template meta programmming is the only "programming language" I know of where you write 100's of lines of code to produce _no_ binary output (IOW optimizer able to eliminate syntax of (portions of) template)
Thank you for your response.. unfortunately I really have no idea how to usefully compare the assembly listings.
One thought I had was to use vTune to look for hotspots in the ICC-generated code. But alas I currently can't run vTune.
I'm not really using template metaprogramming - just compute-heavy algorithms where the inner loops call functions of a templated type. I do use boost smart-pointers and STLport (hash_map etc).
For the record I did also try /Qip, /Qipo and /QxHost, all of which slowed it down further..
Template expanded code is generally inline. So a profiler will not have hot spots to hone in on.
Compile your code using one of the compilers, place a break point on an instantiation of a template that is giving you performance problems. When you reach the break point openthe debugger dissassembly window. Use the mouse to select and copy the section (or Alt-Print-Screen) and paste into Notepad (for text) Paint/Word for window snapshot.
Re-run with other compiler code.
You do not need to understand assembly code just the general concept
more assembly code == slower code
Note, do this test in Release build because you want to investigate the optimized code. Comparing Debug build template code has no relationship on what will be generated in Release build.
Using the dissassembly window you might find places where a value is copied to the stack and read back unnecessarily. This is generally due to an unnecessary (to you) temporary being created and used. This often occures when an argument to a template could be "const" but isn't specified as "const". In some cases the C++ specification _requires_ a copy to be made. A non-compliant compiler might avoid making the copy (and produces the faster code). Correcting the template is the proper way to address the issue.
This turned out to be my own fault, unrelated to ICC.
I was linking the ICL version with the default CRT allocator under WinXP, but the MSVC version was using the SmartHeap allocator. And the sections that were slowed down are apparently allocation-heavy enough that they were massively slowed down by the default allocator.
Actually just switching all heaps to low-fragmentation mode (http://msdn.microsoft.com/en-us/library/aa366750%28VS.85%29.aspx) gives better performance than even SmartHeap.
The intel compiler *did* speed up these template-heavy sections after all, one of them even 3x. I haven't (yet) found any regions where it is slower than MSVC.
But thanks to all who answered - I will look into using 'restrict' in the future & maybe to comparing template code-sizes.
>>But thanks to all who answered - I will look into using 'restrict' in the future & maybe to comparing template code-sizes.
Larger code size is not always slower code, sometimes it is faster code. An example is register packing/unpacking for small copy operations as opposed to calling a library function to copy the data.
Timing test runs of the program is the only certain way.
Glad you found your problem.