Intel® C++ Compiler
Support and discussions for creating C++ code that runs on platforms based on Intel® processors.
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
7679 Discussions

MSVC++ 2008 SP1 outperforms Intel C++ 11.1.065 by a Factor of 900


we are using the ancient version 9.1 of the Intel C++ compiler (Windows) and are thinking about upgrading to the latest version. So I executed a little test taken from a performance critical part of a real life application to see the improvement for one specific task: multiplication and addition of arrays of floating point numbers.

My most significant result is: The compiler of Microsoft VC++ 2008 SP1 outperforms v11.1 by a factor of ~900 in certain conditions and outperforms v10.1 and v9.1 in every condition I tested. There is also one condition where v11.1 outperforms MSVC++ by a factor of ~2.

Some other results (for this specific test case):
  • The Intel C++ compilers improved from v9.1 to 11.1.
  • v9.1 and v10.1 behave nearly similar.
  • General SSE2-only optimization performs a little better than specific SSE4.x optimization.
  • I had a look into some (not all) of the generated assembler instructions. It seems all compilers do not make efficient use of SSE (SIMD) but just replace the FP instructions by their SSE counterparts.

The CPU used for the test was an Intel Core i7-860, the OS was Windows 7 x64. You can find the source code attached. I also tried to find useful compiler switches:
/O2 (MSVC) /O3 (Intel) /Ob2 /Oi /Ot /Oy /GS- Core2 = /QxT SSE4.1 = /QxS SSE4.2 = /QxSSE4.2 SSE2 = /arch:SSE2 precise = /fp:precise fast = /fp:fast --- Intel C++ 9.1.040 Core2 precise : 14945505 s; sum=0.178366 Core2 fast : 14255968 s; sum=0.177309 SSE2 precise : 14404405 s; sum=0.178366 SSE2 fast : 10423662 s; sum=0.177309 non-SSE precise : 23180518 s; sum=0.178366 non-SSE fast : 11306941 s; sum=0.178911 --- Intel C++ 10.1.032 SSE4.1 precise : 14924068 s; sum=0.178366 SSE4.1 fast : 10351496 s; sum=0.177309 SSE2 precise : 14510782 s; sum=0.178366 SSE2 fast : 10462389 s; sum=0.177309 non-SSE precise : 22592039 s; sum=0.178366 non-SSE fast : 11324479 s; sum=0.178911 --- Intel C++ 11.1.065 SSE4.2 precise : 14089 s; sum=0.178366 SSE4.2 fast : 10662940 s; sum=0.177309 SSE2 precise : 12856 s; sum=0.178366 SSE2 fast : 10522354 s; sum=0.177309 non-SSE precise : 21797 s; sum=0.178366 non-SSE fast : 11586162 s; sum=0.178638 --- MSVC++ 2008 SP1 SSE2 precise : 26343 s; sum=0.178366 SSE2 fast : 11577 s; sum=0.178366 non-SSE precise : 23546 s; sum=0.178366 non-SSE fast : 12806 s; sum=0.180000
Look at the "SSE2 fast" configuration, which we currently use for the project where this little test comes from. You can see that MSVC++ outperforms v11.1.065 by a factor of 908 (10522354/11577).

I hope there is some room for improvement for the future versions of the Intel C++ compiler.

UPDATE: The huge difference seemed strange to me more and more. The factor 900 is also quite close to 1000 which is the number of iterations used in my test case for doing the same thing. After some more testing it became clear that the compilers detect or assume that no changes are applied to the input data and that the the final result of the calculation will always be the same. Thus they execute only 1 iteration instead of 1000. The major difference between the compilers results from their code analysis but not from their generated code doing the actual calculation. This also means my test case suffers from a design flaw since I wanted to compare the performance of the actual calculation. Anyway, at least it still tests the code analysis skills of the compilers, which can obviously result in big differences. It seems strange that this analysis works differently in v11.1.065 for precise and fast floating point configurations.

Regards, Toni
0 Kudos
3 Replies

Thanks for the problem report and the test case. I will look into issue and get back to you.


Hi Toni

Thats a classic benchmark pitfall:

Never for constants for benchmarks.

const int iCount = 1000 * 1000;
const int iIterations = 1000;

The compilers are solving then lot of calculations during compiletime not during runtime. Use variables instead marked as volatile. This prohibits precalculations.

Regards as

I reproduced the performance issue and filed a report. I will let you know when this issue is resolved.