optimization options

thomasmang · ‎03-17-2011

Hi,

I am evaluating the trial-version of the Intel Composer XE-2011 for windows (embedded into MSVC8). I have a scientific application (Markov Chain Monte Carlo; 100% ANSI-ISO C++ console application) to be run on an Intel Corei7-2600K processor.
I have compared the performance of the Intel executable to an executable produced by MS Visual Studio 2008 Express, and much to my surprise the Intel version does, with full optimizations enabled including processor-specific ones, reach only ~ 70% of the performance of the Visual Studio binary. The application is single-threaded and things do not change at all when using the automatic parallelization option. Given the chain runs for weeks, this is a very tremenduous slowdown and no compelling reason to purchase the compiler...

Below are the options I had used. Are there any recommendations of what of these I should try to change ?
Most calculations done are usual maths operations like log, exp, multiplications and divisions; no strong memory component involved.

/c /Ox /Og /Oi /Ot /Qipo /GA /I /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MT /GS- /fp:strict /Za /W3 /nologo /Wp64 /Zi /Qfp-speculation:strict /QxSSE4.1 /Qopt-matmul- /Quse-intel-optimized-headers /Qcilk-serialize /Qintel-extensions-

thanks,
Thomas

TimP · ‎03-17-2011

Guessing at what you are doing could take us far afield. For example, if "usual maths operations" means mixing double math functions with float data types, and you use the Microsoft 32-bit default equivalent to ICL /arch:IA32, that might explain it.

mecej4 · ‎03-17-2011

If
(i) Compiler-A is good at optimizing source code Pattern-X and only "so-so" with Pattern-Y, and

(ii) Compiler-B is good at optimizing Pattern-Y and "so-so" with Pattern-X,

what do you expect the performances of the outputs of the two compilers to be if

(iii) your code contains fraction x of Pattern-X and fraction (1-x) of Pattern-Y?

Will the result be independent of x?

Secondly, what motivated you to select that specific set of options? A desire for speed, or something else? Did you use the same options with both compilers?

If you want specific answers, you will need to post a test case which will enable one to reproduce the behavior that you described.

Dale_S_Intel · ‎03-17-2011

As others have pointed out, an actual test case would be extremely helpful, as we're just guessing without one, but the first thing that jumps out to me is that you're running on a Core i7 2600, but compiling with /QxSSE4.1. If you're building on the same machine you run on, try QxHOST, otherwise use /QxAVX to get the maximum benefit for running on that processor.

Even if you could show us some performance critical asm code generated by the two compilers, we might be able to figure some things out, but a runnable test case would be most helpful.

Thanks!

Dale

TimP · ‎03-17-2011

/QxSSE4.1 normally would be satisfactory for running on a variety of recent CPUs. Under the limited description given, I suppose that no auto-vectorization is occurring, so I would not expect to be able to make useful comments about which SSE architecture switch would be best. I have some cases where /QxSSE2 does run significantly faster than the newer options on Core I7. None of the SSE options would deal with the case where 32-bit x87 code is fastest, if that is in fact what you found.

thomasmang · ‎03-20-2011

Thanks for your help guys.

The code base is over 10,000 lines long and further proprietary, so both aspects pretty much exclude posting it here.
Most data are stored in a std::vector at initialization but the bulk of the computations are really calls to

type operator+(type, type)
type operator-(type, type)
type operator*(type, type)
type operator/(type, type)
type operator=(type&, type)
type std::exp(type)
type std::log(type)

where type is either int, double or long double (int only for the former five however) but only seldomly mixing them (double -> long double or reverse casts). Other operations are just std::vector::iterator offsets, sometimes a resize or reserve.

However, it came to my mind that MSVC uses 64-bit representation for long double (same as double) while I think that is not the case for Intel, so that might be part of the story.

I have a second code base, very similar in length and logic to the first but where I consistently use double for all floating point calculations and with a bit a stronger memory access component. That was run on a different machine (a Corei3 something) and compiled with a few different options, and here the Intel compiler produces astonishing 50% faster code compared to MSVC9, which is a very nice result.
I will need to investigate this in-depth in how far the floating pont type used, optimization options set and the hardware the program is run on affect things. I will try to keep you updated.

thanks,
Thomas

TimP · ‎03-20-2011

Quoting thomasmang

type operator+(type, type)
type operator-(type, type)
type operator*(type, type)
type operator/(type, type)
type operator=(type&, type)
type std::exp(type)
type std::log(type)

where type is either int, double or long double (int only for the former five however) but only seldomly mixing them (double -> long double or reverse casts). Other operations are just std::vector::iterator offsets, sometimes a resize or reserve.

However, it came to my mind that MSVC uses 64-bit representation for long double (same as double) while I think that is not the case for Intel, so that might be part of the story.

I have a second code base, very similar in length and logic to the first but where I consistently use double for all floating point calculations and with a bit a stronger memory access component. That was run on a different machine (a Corei3 something) and compiled with a few different options, and here the Intel compiler produces astonishing 50% faster code compared to MSVC9, which is a very nice result.

ICL optimization of std::vector stuff may vary greatly from MSVC. It often requires the restrict extension or #pragma ivdep (neither supported by MSVC) to equal or exceed code generation quality of MSVC. There is a tendency to depend on auto-vectorization for full optimization; even if successful, it will prefer loop counts of 100 or more, at least by default.
The guide options in current ICL are supposed to give you hints about optimization at loop level.
ICL treats long double as double by default, consistent with MSVC.
You still haven't indicated which options you use with MSVC (32-bit or 64-bit, which /arch:, with or without MSVC /fp:fast (which is equivalent to ICL /fp:source)

thomasmang · ‎03-20-2011

MSVC 32 bit, with options /O2 /Ot /arch:SSE2 /fp:precise

Weird, the second program where Intel produces clearly the faster code makes heavier use of std::vector though I am neither using restrict, the #pragma nor auto-vectorization.

Om_S_Intel · ‎03-20-2011

Autovectorization is on when by default when using /O2 in Intel compiler.