Intel® C++ Compiler
Support and discussions for creating C++ code that runs on platforms based on Intel® processors.
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.

Intrinsics and -G7 questions


two questions to the compiler team:

1) As many people have found, in ICC5 and ICC6 on many pieces of optimized code -G7 option gives worse performance (a few percent) than -G6. It has been tested on numerous code and P4s, both Willamette and Northwood; -G7 is never measurably better than -G6.

Can anyone elaborate on what -G7 option really does? Please, I have read the manual, so I'd like to know more than "it optimizes your application to use as many of the features as possible of the processor you specify without making it incompatible with earlier processors."
Spec., what makes (or can make) -G7 run slower than -G6 on P4s?

2) a) Can you explain why __m128, __m128i and __m128d are not compatible? What's more, they cannot be type-cast into one another. As intrinsics are not ambiguous, and the xmm register set is one for all of them, what was the reason for that?
b) Is it because of future compatibility with hypothetical xmm register set that would be split into 8ps, 8pd and 8epi regs?
c) Is this... feature... present in ICC 7, too?

BTW, thanks for the ICC, best-of-the-best-of-the-best.


0 Kudos
4 Replies
Black Belt
I can't speak for the compiler team, but I have used the combination -QxK -G7 extensively. Combinations such as -Qxi -G7 may not be effective as frequently. You may be more interested in finding out what happens in your code than in reading my answer. Why not employ -S and compare some of your code generation results, or use Vtune to find out what is happening?

Among the changes expected with -G7 would be avoidance of integer multiply and shift instructions. For example, multiply by 8 (a constant visible to the compiler) would be accomplished by add instructions. Masking operations, such as (i & 4095), are likely to change to longer instruction sequences.

The Intel compilers don't go as far as gcc -march=pentium4 does, in eliminating all possible shift or integer multiply instructions in favor of long add chains. Still, there may be situations where you have so much else going on, that the shorter code sequence would be faster, even though the -G7 sequence would be faster in isolation.

There are some changes with -G7 which are beneficial even on P-III; since you appear not to have those situations in your code, I won't pursue that.
First, thanks for the answer; sorry for not writing so long. During the last few days I found an hour to look at the asm output. BTW, slight change in variables used caused -G7 performance to vary; -G6 speed was stable and either equal or slightly faster. On the worse code, difference between -G6 and -G7 was on the range of 15%.

Not counting one "inc eax" changed to "add eax,1", the following changes were made in the -G7 version:
- subsequent psrldq's were blocked instead of interleaved,
- all non-dependent int adds were blocked,
- all int add's used in address calculations were pushed one instruction down, closer to the dependent instruction.

While P4's ooo-ness is incredible, all the changes I found were anti-ILP. Any educated guess why?


Its pretty hard to tell from your description exactly why the code might be slower with -G7 than with -G6 when running on a Pentium 4 processor.

As Tim Prince replied -G7 controls the relative cost of various instructions to more closely match the cost of them on Pentium 4 processor. In particular, adds are used to replace left shifts and multiplies, when that can be done and it is both cheaper than the corresponding shift/multiply and doesn't hurt code size "too much".

Inc/decs are documented in the Pentium 4 optimization guide:
They can be slower than the same add sequences due to inc/dec only partially setting the flags register. So the compiler chooses not to use those instructions.

As far as why the code differences seem be to reducing ILP, there are two possible reasons:
1. This is an unintended side-effect of some other instruction selection issue/G7 optimization issue.
2. The compiler is trying to reduce register pressure. Generally spills/reloads are more expensive than many forms of simple arithmetic, or simple memory loads. So the compiler will reduce ILP in many cases to try to reduce register pressure, knowing that the significant OOOness of the Pentium 4 processor will do a better job of regaining the ILP than it would if there were more register spills and reloads.

I can't really make many other guesses about what the issues are without more access to source code and being able to use VTune to zero in on the causes of the performance degradation. If you are interested in pursuing in more depth exactly what is causing the degradation in your application please contact customer support.

Kevin B. Smith
IA32 Code Generator Team Leader
> 1. This is an unintended side-effect of some other
> instruction selection issue/G7 optimization issue.
> [...]
> Kevin B. Smith
> IA32 Code Generator Team Leader

I am very happy to hear from you. The reason turned out to be twofold, though the loop (here pseudo-) code is very simple (combines are simple sse2 int intrinsics):

x = *(from );
y = *(from + 16);
*(to ) = combine(x, y);
x = *(from + 32);
*(to + 16) = combine(y, x);
*(to + 32) = combine(x, key);

ICC did two things here. One, not related to -G7, is the problem of NOT introducing new register variable when it is practical. Can't the compiler do something like ooo CPU core and "rename variables" when there is no dependency? Changing code to the one below has solved the problem; it also allowed to block reads and writes.

x = *(from );
y = *(from + 16);
z = *(from + 32);
*(to ) = combine(x, y);
*(to + 16) = combine(y, z);
*(to + 32) = combine(z, key);

The second issue is more connected to the L1/L2 P4 cache writes' problem - described (and still not answered) in detail in the "P4 L1/L2..." thread. The -G7 option (unlike -G6) changed the order of two last writes, namely it executes first *(to + 32) and only then *(to + 16). Changing the order of writes turned out to be the worst performance inhibitor here. Any insights why - and workarounds, if any - will be very, very appreciated; either here or in the other thread.

Once again, thank you for your time. Regards, Anna