Solved: How to match -O3 from 10.1 in 11.0

dpeterc · ‎01-15-2009

The meaning of default compiler options has changed dramatically from compiler version 10.1 to 11.0

I used to compile my programs with -O3, and it make reasonably sized executable and it ran OK on my customer's various CPUs.

With the introduction of 11.0, the very same makefile generates very different code, since many more advanced optimizations are enabled in -O3. The most disappointing effect was that on older CPUs, program fails with "illegal instruction", as described on release notes in section 2.4.2 (Instruction Set Default Changed to Require Intel Streaming SIMD Extensions 2).

Why does the compiled program include (check with "strings compiled_binary") the following messages:

Fatal Error: This program was not built to run on the processor in your system.
Windows XP 64-bit Edition Version 2003 or newer should be used.
Intel Core Duo processors and compatible Intel processors with supplemental Streaming SIMD Extensions 3 (SSSE3) instruction support
Intel Pentium 4 and compatible Intel processors with Streaming SIMD Extensions 3 (SSE3) instruction support
Intel Pentium 4 and compatible Intel processors. Enables new optimizations in addition to Intel processor-specific optimizations
Intel processors with SSE4.2 and POPCNT instructions support
Intel Pentium M and compatible Intel processors
Intel processors with Swing New Instructions support
Intel processors with MOVBE instructions support

Why can Intel compiler use cpuid instruction in automatic CPU dispatch, but not in regular compilation, to give a meaningful message? And waste space with messages which are not used?
It is documented in release notes, but does this upgrade a bug into feature?

So now I compile the program with -mia32, but very few optimizations are used that way.
So few, that compiling the program with -O1 does not make any speed difference with respect to -O3, just the binary is 40% bigger with -O3.

Now finally to my real question:
Which set of compiler options must I use with compiler 11.0, to get the same level of optimization which I had on 10.1, when I compiled the program with -O3 ? So that the program will run at same speed and on same set of CPUs as before.

P.S.
One more thing about the error messages.
Since I use Linux version of the compiler, the error message should be:
Windows XP 64-bit Edition Version 2003 or newer should never be used.
Or should it also include messages about all other operating systems?

TimP · ‎01-15-2009

-mSSE2 will work on P4. It would be a very unusual situation where -mia32 ) might perform better. If you really want the compiler to choose 2 code paths where that will speed up SSE3 capable CPUs, -axSSE3 should generate a separate SSE3 path only where the compiler expects significant advantage over SSE2, and (except for very specialized situations) you would get all the performance of SSSE3. With integer code, there's probably even less chance of gain with more than a single code path aimed for the oldest CPU you must support.

View solution in original post

TimP · ‎01-15-2009

If you are using -mia32, you must be using the 32-bit compiler, where this option should match the previous default (no SSE code). A majority of the previous architecture switches are still available, and have alternate new names. You don't even say clearly which earlier compiler you are comparing with, which CPU architecture you intend to target, so it's impossible to give much of an answer.
Much of the increased code size with default options came in with 10.0, when -ip became a default, and -O3 began to imply hpo options. Large loops are excessively "distributed" (split into smaller loops). 11.0 tends to regroup the vectorized portions effectively (as far as performance is concerned, but at the cost of code size).
Splitting of both vector non-vector loops can be controlled with #pragma distribute point. Of course, this is annoying, where the 9.1 compiler did a good job without this, particularly for large applications.
-O1 is documented as a compromise between code size and performance, with no vectorization. It is frequently appropriate, if there is no advantage to be gained from vectorization or hpo (loop splits, fusion, interchange,...)

dpeterc · ‎01-15-2009

Thank you for your reply, Tim.
Quoting - tim18

You don't even say clearly which earlier compiler you are comparing with, which CPU architecture you intend to target, so it's impossible to give much of an answer.

I am comparing it to icc 10.1, like I said in the first line of my post.

As to the target CPU arhitecture, Pentium 4 or newer. Since I develop on Quad Core, I would like to have some of the speedups of the new processors.

So would these options on 11.0
-O1 -axT -mia32
give me similar performance and run on same processors, as the 10.1 compile option
-O3
The code size is roughly the same.

On 10.1, with my application, there was significant speedup (20%) going from O1 to O3.
On 11.0, with my application, there is about 5% performance drop going from O1 to O3.

Sorry, can't share my code. But it is loopy integer code, CPU bound.

TimP · ‎01-15-2009

-mSSE2 will work on P4. It would be a very unusual situation where -mia32 ) might perform better. If you really want the compiler to choose 2 code paths where that will speed up SSE3 capable CPUs, -axSSE3 should generate a separate SSE3 path only where the compiler expects significant advantage over SSE2, and (except for very specialized situations) you would get all the performance of SSSE3. With integer code, there's probably even less chance of gain with more than a single code path aimed for the oldest CPU you must support.