Are there any general conclusions about how the O and xT flags affect the performance of the code?

max_w_nimitz · ‎02-06-2008

I've been testing how different flags affect the speed and the accuracy of my code.
Usually it is always recommended that use -O3 -xT flags wherever you can to improve the performance of your code. Indeed, I find out that when I was using different flags from these, I got different results of accuracy and speed. In my case, of course the -O0 and -O0 -xT are the slowest flags, and the one with the -xT is slightly a little faster than just -O0. Also, -O1 -xT seems to be fastest option. Another thing I found out is that xT seems to play a more important role when it's combined with -O2 or -O3, as the speed is no different whether I've got -O2 or -O3 with xT or not, on the other hand, if it's just -O2 or -O3 without -xT option, the results didn't make a difference, either, but different from the -xT case. The thing I couldn't figure out is the accuracy between -O1 -xT and -xT flags. Could anyone explain this to me or maybe give me a more detailed view about the flags I'm using?

-xT = -O2 -xT = -O3 -xT
-O2 = -O3

oh_moose · ‎02-07-2008

-xT is equivalent to -march=core2 (just to make this discussion a bit more transparent)

I assume you have an Intel Pentium Core2 processor. Apparently the Intel hardware team and the Intel compiler team have done a good job in coordinating the efforts to improve the performance of the Intel Pentium Core 2 architecture over its predecessors. Your results may be different if you run the program on an older Intel Pentium 4.

If you do not see any difference between -O2 and -O3, then perhaps that is because your simple (?) test application does not benefit from the "aggressive data dependency analysis"

(see documentation).

You could use the qualifiers -S -fverbose-asm -fsource-asm to produce assembler code and study the effects. Of course it would be nicer to have an official documentation from Intel.

TimP · ‎02-07-2008

A little more information would be needed to give much of an answer. The default (without -xT) changed, for the 64-bit compiler, when ifort 10.0 came in. -xT with -O1 or higher asks the compiler to use SSE, SSE2, SSE3, and SSSE3 code where possible. If you don't have any vectorizable source code, the main difference is in whether you choose x87 code with all expressions evaluated in double precision (if you have a 32-bit compiler) or SSE. If you have code which needs -assume protect_parens or -fp-model precise, the results of not using that option with default real will be entirely different between x87 and SSE code.
Simply adding -xT to -O0 probably does nothing.
The effcct of -O1 was changed between 9.1 and 10.0. With 10.0, it disables auto-vectorization. That could run faster if none of your loops have suitable lengths for vectorization, yet the compiler generates vector code in case there are such lengths.
In my experience, -xT is usually slower than -xW or -xP, but it wasn't meant to be so. It's faster if your loop vectorizes with that option, but not with the others.