Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

optimization options

thomasmang
Beginner
555 Views
Hi,

I am evaluating the trial-version of the Intel Composer XE-2011 for windows (embedded into MSVC8). I have a scientific application (Markov Chain Monte Carlo; 100% ANSI-ISO C++ console application) to be run on an Intel Corei7-2600K processor.
I have compared the performance of the Intel executable to an executable produced by MS Visual Studio 2008 Express, and much to my surprise the Intel version does, with full optimizations enabled including processor-specific ones, reach only ~ 70% of the performance of the Visual Studio binary. The application is single-threaded and things do not change at all when using the automatic parallelization option. Given the chain runs for weeks, this is a very tremenduous slowdown and no compelling reason to purchase the compiler...

Below are the options I had used. Are there any recommendations of what of these I should try to change ?
Most calculations done are usual maths operations like log, exp, multiplications and divisions; no strong memory component involved.

/c /Ox /Og /Oi /Ot /Qipo /GA /I /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MT /GS- /fp:strict /Za /W3 /nologo /Wp64 /Zi /Qfp-speculation:strict /QxSSE4.1 /Qopt-matmul- /Quse-intel-optimized-headers /Qcilk-serialize /Qintel-extensions-


thanks,
Thomas
0 Kudos
8 Replies
TimP
Honored Contributor III
555 Views
Guessing at what you are doing could take us far afield. For example, if "usual maths operations" means mixing double math functions with float data types, and you use the Microsoft 32-bit default equivalent to ICL /arch:IA32, that might explain it.
0 Kudos
mecej4
Honored Contributor III
555 Views
If
(i) Compiler-A is good at optimizing source code Pattern-X and only "so-so" with Pattern-Y, and

(ii) Compiler-B is good at optimizing Pattern-Y and "so-so" with Pattern-X,

what do you expect the performances of the outputs of the two compilers to be if

(iii) your code contains fraction x of Pattern-X and fraction (1-x) of Pattern-Y?

Will the result be independent of x?

Secondly, what motivated you to select that specific set of options? A desire for speed, or something else? Did you use the same options with both compilers?

If you want specific answers, you will need to post a test case which will enable one to reproduce the behavior that you described.
0 Kudos
Dale_S_Intel
Employee
555 Views
As others have pointed out, an actual test case would be extremely helpful, as we're just guessing without one, but the first thing that jumps out to me is that you're running on a Core i7 2600, but compiling with /QxSSE4.1. If you're building on the same machine you run on, try QxHOST, otherwise use /QxAVX to get the maximum benefit for running on that processor.
Even if you could show us some performance critical asm code generated by the two compilers, we might be able to figure some things out, but a runnable test case would be most helpful.
Thanks!

Dale
0 Kudos
TimP
Honored Contributor III
555 Views
/QxSSE4.1 normally would be satisfactory for running on a variety of recent CPUs. Under the limited description given, I suppose that no auto-vectorization is occurring, so I would not expect to be able to make useful comments about which SSE architecture switch would be best. I have some cases where /QxSSE2 does run significantly faster than the newer options on Core I7. None of the SSE options would deal with the case where 32-bit x87 code is fastest, if that is in fact what you found.
0 Kudos
thomasmang
Beginner
555 Views
Thanks for your help guys.

The code base is over 10,000 lines long and further proprietary, so both aspects pretty much exclude posting it here.
Most data are stored in a std::vector at initialization but the bulk of the computations are really calls to

type operator+(type, type)
type operator-(type, type)
type operator*(type, type)
type operator/(type, type)
type operator=(type&, type)
type std::exp(type)
type std::log(type)

where type is either int, double or long double (int only for the former five however) but only seldomly mixing them (double -> long double or reverse casts). Other operations are just std::vector::iterator offsets, sometimes a resize or reserve.

However, it came to my mind that MSVC uses 64-bit representation for long double (same as double) while I think that is not the case for Intel, so that might be part of the story.

I have a second code base, very similar in length and logic to the first but where I consistently use double for all floating point calculations and with a bit a stronger memory access component. That was run on a different machine (a Corei3 something) and compiled with a few different options, and here the Intel compiler produces astonishing 50% faster code compared to MSVC9, which is a very nice result.
I will need to investigate this in-depth in how far the floating pont type used, optimization options set and the hardware the program is run on affect things. I will try to keep you updated.

thanks,
Thomas
0 Kudos
TimP
Honored Contributor III
555 Views
Quoting thomasmang


type operator+(type, type)
type operator-(type, type)
type operator*(type, type)
type operator/(type, type)
type operator=(type&, type)
type std::exp(type)
type std::log(type)

where type is either int, double or long double (int only for the former five however) but only seldomly mixing them (double -> long double or reverse casts). Other operations are just std::vector::iterator offsets, sometimes a resize or reserve.

However, it came to my mind that MSVC uses 64-bit representation for long double (same as double) while I think that is not the case for Intel, so that might be part of the story.

I have a second code base, very similar in length and logic to the first but where I consistently use double for all floating point calculations and with a bit a stronger memory access component. That was run on a different machine (a Corei3 something) and compiled with a few different options, and here the Intel compiler produces astonishing 50% faster code compared to MSVC9, which is a very nice result.

ICL optimization of std::vector stuff may vary greatly from MSVC. It often requires the restrict extension or #pragma ivdep (neither supported by MSVC) to equal or exceed code generation quality of MSVC. There is a tendency to depend on auto-vectorization for full optimization; even if successful, it will prefer loop counts of 100 or more, at least by default.
The guide options in current ICL are supposed to give you hints about optimization at loop level.
ICL treats long double as double by default, consistent with MSVC.
You still haven't indicated which options you use with MSVC (32-bit or 64-bit, which /arch:, with or without MSVC /fp:fast (which is equivalent to ICL /fp:source)
0 Kudos
thomasmang
Beginner
555 Views
MSVC 32 bit, with options /O2 /Ot /arch:SSE2 /fp:precise

Weird, the second program where Intel produces clearly the faster code makes heavier use of std::vector though I am neither using restrict, the #pragma nor auto-vectorization.

0 Kudos
Om_S_Intel
Employee
555 Views
Autovectorization is on when by default when using /O2 in Intel compiler.
0 Kudos
Reply