Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7942 Discussions

Compiled XE 2013.0 Update 1 generates slower code than 12.1

AndrewC
New Contributor III
525 Views

After switching from 12.1 to Composer XE 2013 ( Update 1, Windows 64-bit) I am seeing a consistent 10-15% slowdown across the board( code is built and  benchmarked on a Quad Core Xeon). C++ Code compiled /O3, no auto-parellization.

Is this a known issue to be fixed in an update?

0 Kudos
32 Replies
Armando_Lazaro_Alami
102 Views

Hi,  Reading this topic I decided to re-compile some of my own test codes with Intel 12.1.4.325,  Intel 13.1.0.149  and MSVC 2008. Follows some results in seconds :

                                    13.1.0.149          12.1.4.325      MSVC 2008

NNS  brute                       43.87                 52.55           70.40  

NNS  NearPT                   0.363                  0.411           0.2893

NNS kdtree                      1.138                  1.152           1.074

Mixed code                       3.732                 3.649            4.937

All compilations with equivalent optimization options, generating code for 32 bits, running under Windows 7 in i7 920. Target proc. as SSE2 only.  NNS is "Nearest Neighbor Search", in three flavors .  "mixed code"  use 64 bits integer, floating point operations including some trascendental functions.  

0 Kudos
Armando_Lazaro_Alami
102 Views

Hi Sergey.  I was impressed by the times you reported with MinGW, so I teste the last version available.  But my results are very bad for the same test cases in my previous post.    I am not familiar with MinGW, my switches were  -O2 -msse3 .  Results :

NNS  brute       107.06

NNS NearPt     1.989

NNS kdtree      5.951

Mixed code     7.336  (very good for 64 bits ints, very bad for transcendental and Taylor series)

Could you please , recommend me better switches for maximum performance in SSE3 with MinGW ?

Thanks.  

0 Kudos
AndrewC
New Contributor III
102 Views

The original issue I reported was a very specific issue related to inlining inside a 'bottleneck' function .Once this specific issue was worked-around, I found no other performance issues with compiler 13.1

0 Kudos
SergeyKostrov
Valued Contributor II
102 Views
[ Armando wrote ] >>... I am not familiar with MinGW, my switches were -O2 -msse3 These options look right. [ Vasci_ wrote ] >>...The original issue I reported was a very specific issue related to inlining inside a 'bottleneck' function .Once this specific issue was >>worked-around, I found no other performance issues with compiler 13.1... Everybody wants to get the best return from Intel C++ compiler and that is why these performance evaluations are done. Personally, I want to be confident in Intel C++ compiler and, of course, I understand that it is simply impossible for Intel software developers / engineers to test the compiler with tens of thousands of different algorithms. C/C++ codes I've tested are very portable and I'm concerned that there are no performance gains in some cases. I'm less concerned when some "cool" C++11 feature is partially supported or not supported at all. I simply need to do processing in as fastest as possible way.
0 Kudos
SergeyKostrov
Valued Contributor II
102 Views
Here is a short follow up... I've done additional verification and I see that the situation is more complicated. MinGW C++ compiler ( v3.4.2 ) outperformed Intel C++ compiler ( v12.x ) by ~19 percent and this is a significant difference. [ MinGW C++ compiler - -O2 ] ... Matrix Size : 1024 x 1024 ... Strassen HBC - Pass 1 - Completed: 4.32800 secs Strassen HBC - Pass 2 - Completed: 2.57800 secs Strassen HBC - Pass 3 - Completed: 2.57800 secs Strassen HBC - Pass 4 - Completed: 2.56300 secs Note: Best time ( BT1 ) for MinGW - ~19 percent faster than BT2 Strassen HBC - Pass 5 - Completed: 2.57800 secs ... [ Intel C++ compiler - /O2 ] ... Matrix Size : 1024 x 1024 ... Strassen HBC - Pass 1 - Completed: 4.92200 secs Strassen HBC - Pass 2 - Completed: 3.20300 secs Strassen HBC - Pass 3 - Completed: 3.18700 secs Note: Best time ( BT2 ) for ICC Strassen HBC - Pass 4 - Completed: 3.20400 secs Strassen HBC - Pass 5 - Completed: 3.18700 secs ... Do you want me to compare MinGW C++ compiler ( v3.4.2 ) vs. Intel C++ compiler ( v13.x ) on a: Dell Precision Mobile M4700 Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ) 32GB RAM 320GB HDD NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory ) Windows 7 Professional 64-bit system?
0 Kudos
Armando_Lazaro_Alami
102 Views

Hi  Sergey,  Your results are astonishing for me !  My source code is not C++, it is C99, this could make a great difference for the compiler optimizer (only my guessing).   In our case performance is very important because we develop medical physics systems for radiation therapy planning and image guided neurosurgery; in some regions a solution takes several seconds or minutes.   I frequently use a battery of test cases in C99 that resembles real problems.  I will try some other codes with MinGW but my first impressions were frustraiting with it. For Intel compiler I mostly use :

/O3 /Ob2 /Oi /Ot /Oy /Qip /GA  /GF /MT  /GS- /arch:SSE2 /fp:fast=2  /Qfp-speculation:fast /fp:double /Qparallel /Qstd=c99 

The use of  a conservative SSE2 is to avoid problems with users keeping old hardware and some with AMD CPUs.   Some of this switches are unknown for me in MinGW, for example  /Qparallel .

When I have new reults I will put them here.

Armando

0 Kudos
SergeyKostrov
Valued Contributor II
102 Views
>>...The use of a conservative SSE2 is to avoid problems with users keeping old hardware and some with AMD CPUs... This is absolutely right approach and I can not assume that everybody has a computer with a latest Intel CPU with support for AVX or AVX2 instruction sets. I also don't think that application of pure C, C99, C++ or C++11 should affect a compiler's optimization. For example, Borland C++ is a 15+ year old technology and it outperforms almost all modern C++ compilers when optimization switches are not used at all when compiling sources. It is a clear indication of how efficient its default code generation is. I'm trying to bring attention to that matter of Intel software engineers for some time and I'm not sure that resolute steps to reduce overheads of Intel C++ compiler default code generation are done. Thanks for the information about your software and it is very interesting to know.
0 Kudos
TimP
Honored Contributor III
102 Views

Armando Lazaro Alaminos Bouza wrote:

  My source code is not C++, it is C99, this could make a great difference for the compiler optimizer (only my guessing).   In our case performance is very important because we develop medical physics systems for radiation therapy planning and image guided neurosurgery; in some regions a solution takes several seconds or minutes.   I frequently use a battery of test cases in C99 that resembles real problems.  I will try some other codes with MinGW but my first impressions were frustraiting with it. For Intel compiler I mostly use :

/O3 /Ob2 /Oi /Ot /Oy /Qip /GA  /GF /MT  /GS- /arch:SSE2 /fp:fast=2  /Qfp-speculation:fast /fp:double /Qparallel /Qstd=c99 

The use of  a conservative SSE2 is to avoid problems with users keeping old hardware and some with AMD CPUs.   Some of this switches are unknown for me in MinGW, for example  /Qparallel .

C99 vs C++ makes no difference to the optimizer, with the compilers mentioned here, unless your style changes (as it well might).

g++ accepts the C99 restrict qualifier if it is spelled __restrict   (and Intel C++ for linux accepts that spelling as well)  ICL accepts restrict in C++ code with option /Qrestrict.  Both Intel and gnu compilers make good use of restrict pointers ( * __restrict ptr).

You have a point in that some of the C99 features for optimization will not work with all C++ compilers

Intel 14.0 beta shows better optimization of certain STL as well as plain C99 features enabled by __restrict where current Intel released compilers require #pragma ivdep.

I find such long option strings confusing  I will mention that /fp:double could prevent some optimizations on float data type due to the requirement to promote so many operations to double.  It also prevents optimizations on sum reduction, where it would be better to write in the double casts and reduction variables in your source code so as to control how it's applied.

If you use only double data types, /fp:double I think would have the same effect as /fp:source.

0 Kudos
Armando_Lazaro_Alami
102 Views

Thanks for the clarifications !  I stoped using explicit  restrict on the switches because the C99 option (  /Qstd=c99  ) should include that. About the "source"  or "double"  evaluation for floating point, I use "source"  in some code where single pres. is acceptable and "double" where needing  to preserve all the bits of the representation.  Of course, mixing has a performance penalty.

0 Kudos
SergeyKostrov
Valued Contributor II
102 Views
>>...Of course, mixing has a performance penalty... It is up to you to decide what is more important, that is, a precision or performance. I'm confident that some balance between these two almost opposite things could be always found. Take a look at: Forum topic: Mixing of Floating-Point Types ( MFPT ) when performing calculations. Does it improve accuracy? Web-link: software.intel.com/en-us/forums/topic/361134 as soon as you have time. Thanks.
0 Kudos
Armando_Lazaro_Alami
102 Views

Hi  Sergey,  Good point and topic !   (  source  vs  double  :   precision vs performance )

I learned about that three years ago when migrated  my projects from Watcom C (OpenWatcom at the time) to Intel.  In Watcom every floating point is processed with the traditional FPU (80 bits)  and the compiler use as acummulator an FPU register.   So my first .exe   generated with Intel gave different and worst results.  Using the switch   /fp:double  in intel was enough to reach same results, as far as they are physically relevant.

By the way,  if you like to try good old  C/C++ compilers, take a look a Watcom.  I think that it has not support for modern flavors of  C++.  But, for example, integer processing , bit operations, etc are great .  In the floating point processing  it is weak, because there is no use of SSEx.   Other drawback (for me) is lack of support for OpenMP.    In the past (some 15 years ago)  Watcom was the performance winner in most contest, in front of Borland, MS, etc.

0 Kudos
SergeyKostrov
Valued Contributor II
102 Views
>>...In the past (some 15 years ago) Watcom was the performance winner in most contest, in front of Borland, MS, etc... I know it because I used Watcom C++ compiler for developing Netware Loadable Modules ( NLMs ) between 1994 and 1998.
0 Kudos
Reply