- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am evaluating the trial-version of the Intel Composer XE-2011 for windows (embedded into MSVC8). I have a scientific application (Markov Chain Monte Carlo; 100% ANSI-ISO C++ console application) to be run on an Intel Corei7-2600K processor.
I have compared the performance of the Intel executable to an executable produced by MS Visual Studio 2008 Express, and much to my surprise the Intel version does, with full optimizations enabled including processor-specific ones, reach only ~ 70% of the performance of the Visual Studio binary. The application is single-threaded and things do not change at all when using the automatic parallelization option. Given the chain runs for weeks, this is a very tremenduous slowdown and no compelling reason to purchase the compiler...
Below are the options I had used. Are there any recommendations of what of these I should try to change ?
Most calculations done are usual maths operations like log, exp, multiplications and divisions; no strong memory component involved.
/c /Ox /Og /Oi /Ot /Qipo /GA /I /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MT /GS- /fp:strict /Za /W3 /nologo /Wp64 /Zi /Qfp-speculation:strict /QxSSE4.1 /Qopt-matmul- /Quse-intel-optimized-headers /Qcilk-serialize /Qintel-extensions-
thanks,
Thomas
I am evaluating the trial-version of the Intel Composer XE-2011 for windows (embedded into MSVC8). I have a scientific application (Markov Chain Monte Carlo; 100% ANSI-ISO C++ console application) to be run on an Intel Corei7-2600K processor.
I have compared the performance of the Intel executable to an executable produced by MS Visual Studio 2008 Express, and much to my surprise the Intel version does, with full optimizations enabled including processor-specific ones, reach only ~ 70% of the performance of the Visual Studio binary. The application is single-threaded and things do not change at all when using the automatic parallelization option. Given the chain runs for weeks, this is a very tremenduous slowdown and no compelling reason to purchase the compiler...
Below are the options I had used. Are there any recommendations of what of these I should try to change ?
Most calculations done are usual maths operations like log, exp, multiplications and divisions; no strong memory component involved.
/c /Ox /Og /Oi /Ot /Qipo /GA /I /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MT /GS- /fp:strict /Za /W3 /nologo /Wp64 /Zi /Qfp-speculation:strict /QxSSE4.1 /Qopt-matmul- /Quse-intel-optimized-headers /Qcilk-serialize /Qintel-extensions-
thanks,
Thomas
Link Copied
8 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Guessing at what you are doing could take us far afield. For example, if "usual maths operations" means mixing double math functions with float data types, and you use the Microsoft 32-bit default equivalent to ICL /arch:IA32, that might explain it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If
(i) Compiler-A is good at optimizing source code Pattern-X and only "so-so" with Pattern-Y, and
(ii) Compiler-B is good at optimizing Pattern-Y and "so-so" with Pattern-X,
what do you expect the performances of the outputs of the two compilers to be if
(iii) your code contains fraction x of Pattern-X and fraction (1-x) of Pattern-Y?
Will the result be independent of x?
Secondly, what motivated you to select that specific set of options? A desire for speed, or something else? Did you use the same options with both compilers?
If you want specific answers, you will need to post a test case which will enable one to reproduce the behavior that you described.
(i) Compiler-A is good at optimizing source code Pattern-X and only "so-so" with Pattern-Y, and
(ii) Compiler-B is good at optimizing Pattern-Y and "so-so" with Pattern-X,
what do you expect the performances of the outputs of the two compilers to be if
(iii) your code contains fraction x of Pattern-X and fraction (1-x) of Pattern-Y?
Will the result be independent of x?
Secondly, what motivated you to select that specific set of options? A desire for speed, or something else? Did you use the same options with both compilers?
If you want specific answers, you will need to post a test case which will enable one to reproduce the behavior that you described.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As others have pointed out, an actual test case would be extremely helpful, as we're just guessing without one, but the first thing that jumps out to me is that you're running on a Core i7 2600, but compiling with /QxSSE4.1. If you're building on the same machine you run on, try QxHOST, otherwise use /QxAVX to get the maximum benefit for running on that processor.
Even if you could show us some performance critical asm code generated by the two compilers, we might be able to figure some things out, but a runnable test case would be most helpful.
Thanks!
Dale
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
/QxSSE4.1 normally would be satisfactory for running on a variety of recent CPUs. Under the limited description given, I suppose that no auto-vectorization is occurring, so I would not expect to be able to make useful comments about which SSE architecture switch would be best. I have some cases where /QxSSE2 does run significantly faster than the newer options on Core I7. None of the SSE options would deal with the case where 32-bit x87 code is fastest, if that is in fact what you found.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your help guys.
The code base is over 10,000 lines long and further proprietary, so both aspects pretty much exclude posting it here.
Most data are stored in a std::vector at initialization but the bulk of the computations are really calls to
type operator+(type, type)
type operator-(type, type)
type operator*(type, type)
type operator/(type, type)
type operator=(type&, type)
type std::exp(type)
type std::log(type)
where type is either int, double or long double (int only for the former five however) but only seldomly mixing them (double -> long double or reverse casts). Other operations are just std::vector::iterator offsets, sometimes a resize or reserve.
However, it came to my mind that MSVC uses 64-bit representation for long double (same as double) while I think that is not the case for Intel, so that might be part of the story.
I have a second code base, very similar in length and logic to the first but where I consistently use double for all floating point calculations and with a bit a stronger memory access component. That was run on a different machine (a Corei3 something) and compiled with a few different options, and here the Intel compiler produces astonishing 50% faster code compared to MSVC9, which is a very nice result.
I will need to investigate this in-depth in how far the floating pont type used, optimization options set and the hardware the program is run on affect things. I will try to keep you updated.
thanks,
Thomas
The code base is over 10,000 lines long and further proprietary, so both aspects pretty much exclude posting it here.
Most data are stored in a std::vector at initialization but the bulk of the computations are really calls to
type operator+(type, type)
type operator-(type, type)
type operator*(type, type)
type operator/(type, type)
type operator=(type&, type)
type std::exp(type)
type std::log(type)
where type is either int, double or long double (int only for the former five however) but only seldomly mixing them (double -> long double or reverse casts). Other operations are just std::vector::iterator offsets, sometimes a resize or reserve.
However, it came to my mind that MSVC uses 64-bit representation for long double (same as double) while I think that is not the case for Intel, so that might be part of the story.
I have a second code base, very similar in length and logic to the first but where I consistently use double for all floating point calculations and with a bit a stronger memory access component. That was run on a different machine (a Corei3 something) and compiled with a few different options, and here the Intel compiler produces astonishing 50% faster code compared to MSVC9, which is a very nice result.
I will need to investigate this in-depth in how far the floating pont type used, optimization options set and the hardware the program is run on affect things. I will try to keep you updated.
thanks,
Thomas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting thomasmang
type operator+(type, type)
type operator-(type, type)
type operator*(type, type)
type operator/(type, type)
type operator=(type&, type)
type std::exp(type)
type std::log(type)
where type is either int, double or long double (int only for the former five however) but only seldomly mixing them (double -> long double or reverse casts). Other operations are just std::vector::iterator offsets, sometimes a resize or reserve.
However, it came to my mind that MSVC uses 64-bit representation for long double (same as double) while I think that is not the case for Intel, so that might be part of the story.
I have a second code base, very similar in length and logic to the first but where I consistently use double for all floating point calculations and with a bit a stronger memory access component. That was run on a different machine (a Corei3 something) and compiled with a few different options, and here the Intel compiler produces astonishing 50% faster code compared to MSVC9, which is a very nice result.
The guide options in current ICL are supposed to give you hints about optimization at loop level.
ICL treats long double as double by default, consistent with MSVC.
You still haven't indicated which options you use with MSVC (32-bit or 64-bit, which /arch:, with or without MSVC /fp:fast (which is equivalent to ICL /fp:source)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MSVC 32 bit, with options /O2 /Ot /arch:SSE2 /fp:precise
Weird, the second program where Intel produces clearly the faster code makes heavier use of std::vector though I am neither using restrict, the #pragma nor auto-vectorization.
Weird, the second program where Intel produces clearly the faster code makes heavier use of std::vector though I am neither using restrict, the #pragma nor auto-vectorization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Autovectorization is on when by default when using /O2 in Intel compiler.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page