MSVC++ 2008 SP1 outperforms Intel C++ 11.1.065 by a Factor of 9

TZeit · ‎06-12-2010

Hello,

we are using the ancient version 9.1 of the Intel C++ compiler (Windows) and are thinking about upgrading to the latest version. So I executed a little test taken from a performance critical part of a real life application to see the improvement for one specific task: multiplication and addition of arrays of floating point numbers.

My most significant result is: The compiler of Microsoft VC++ 2008 SP1 outperforms v11.1 by a factor of ~900 in certain conditions and outperforms v10.1 and v9.1 in every condition I tested. There is also one condition where v11.1 outperforms MSVC++ by a factor of ~2.

Some other results (for this specific test case):

The Intel C++ compilers improved from v9.1 to 11.1.
v9.1 and v10.1 behave nearly similar.
General SSE2-only optimization performs a little better than specific SSE4.x optimization.
I had a look into some (not all) of the generated assembler instructions. It seems all compilers do not make efficient use of SSE (SIMD) but just replace the FP instructions by their SSE counterparts.

The CPU used for the test was an Intel Core i7-860, the OS was Windows 7 x64. You can find the source code attached. I also tried to find useful compiler switches:


/O2 (MSVC) /O3 (Intel) /Ob2 /Oi /Ot /Oy /GS-

Core2   = /QxT
SSE4.1  = /QxS
SSE4.2  = /QxSSE4.2
SSE2    = /arch:SSE2
precise = /fp:precise
fast    = /fp:fast

--- Intel C++ 9.1.040
Core2 precise   : 14945505 s; sum=0.178366
Core2 fast      : 14255968 s; sum=0.177309
SSE2 precise    : 14404405 s; sum=0.178366
SSE2 fast       : 10423662 s; sum=0.177309
non-SSE precise : 23180518 s; sum=0.178366
non-SSE fast    : 11306941 s; sum=0.178911

--- Intel C++ 10.1.032
SSE4.1 precise  : 14924068 s; sum=0.178366
SSE4.1 fast     : 10351496 s; sum=0.177309
SSE2 precise    : 14510782 s; sum=0.178366
SSE2 fast       : 10462389 s; sum=0.177309
non-SSE precise : 22592039 s; sum=0.178366
non-SSE fast    : 11324479 s; sum=0.178911

--- Intel C++ 11.1.065
SSE4.2 precise  :    14089 s; sum=0.178366
SSE4.2 fast     : 10662940 s; sum=0.177309
SSE2 precise    :    12856 s; sum=0.178366
SSE2 fast       : 10522354 s; sum=0.177309
non-SSE precise :    21797 s; sum=0.178366
non-SSE fast    : 11586162 s; sum=0.178638

--- MSVC++ 2008 SP1
SSE2 precise    :    26343 s; sum=0.178366
SSE2 fast       :    11577 s; sum=0.178366
non-SSE precise :    23546 s; sum=0.178366
non-SSE fast    :    12806 s; sum=0.180000

Look at the "SSE2 fast" configuration, which we currently use for the project where this little test comes from. You can see that MSVC++ outperforms v11.1.065 by a factor of 908 (10522354/11577).

I hope there is some room for improvement for the future versions of the Intel C++ compiler.

UPDATE: The huge difference seemed strange to me more and more. The factor 900 is also quite close to 1000 which is the number of iterations used in my test case for doing the same thing. After some more testing it became clear that the compilers detect or assume that no changes are applied to the input data and that the the final result of the calculation will always be the same. Thus they execute only 1 iteration instead of 1000. The major difference between the compilers results from their code analysis but not from their generated code doing the actual calculation. This also means my test case suffers from a design flaw since I wanted to compare the performance of the actual calculation. Anyway, at least it still tests the code analysis skills of the compilers, which can obviously result in big differences. It seems strange that this analysis works differently in v11.1.065 for precise and fast floating point configurations.

Regards, Toni

Mark_S_Intel1 · ‎06-14-2010

Toni,

Thanks for the problem report and the test case. I will look into issue and get back to you.

--mark

Andreas_S_ · ‎06-19-2010

Hi Toni

Thats a classic benchmark pitfall:

Never for constants for benchmarks.

(
const int iCount = 1000 * 1000;
const int iIterations = 1000;
)

The compilers are solving then lot of calculations during compiletime not during runtime. Use variables instead marked as volatile. This prohibits precalculations.

Regards as

Mark_S_Intel1 · ‎07-15-2010

Toni,

I reproduced the performance issue and filed a report. I will let you know when this issue is resolved.

Thanks,
--mark

MSVC++ 2008 SP1 outperforms Intel C++ 11.1.065 by a Factor of 900