Check out the new VTune(TM) Analyzer optimization guide for Core(TM) i7 processors

Shannon_C_Intel · ‎02-20-2009

We have a new presentation describing how to do an initial performance investigation on Intel Core i7 platforms. It includes which events to sample and what methodology to use. See it here and let us know what you think.

TimP · ‎02-24-2009

Quoting - Shannon Cepeda (Intel)

We have a new presentation describing how to do an initial performance investigation on Intel Core i7 platforms. It includes which events to sample and what methodology to use. See it here and let us know what you think.

It's a good start.
Some random comments:
It's good that youhint thatit's easier to "improve" CPI by generating more instructions than by actually improving performance.
You talk about having Intelcompilers remove branches for alignment as if it had already been accomplished. This might be confusing when one drills down and sees the extra code branches.
You might mention the gcc setting -march=barcelona -msse4 in order to optimize for Core i7 (-msse3 if optimizing for bothi7 and those other CPUs).

srimks · ‎02-25-2009

Quoting - Shannon Cepeda (Intel)

We have a new presentation describing how to do an initial performance investigation on Intel Core i7 platforms. It includes which events to sample and what methodology to use. See it here and let us know what you think.

Thanks Sannon,

It's very informative. Probably the idea as brought in this slides could help for any 2- 5 years old Intel processors, like Quad core 5300, Itanium, etc and currently Core i7.

Do you have such more slides or documentation about digging into Application analysis using Intel VTune. I do refer Intel VTune book and David Levinthal analysis(his 16 page documentations), but looking more if I can get some.

~BR

Eric_M_Intel2 · ‎03-05-2009

Quoting - tim18

It's a good start.
Some random comments:
It's good that youhint thatit's easier to "improve" CPI by generating more instructions than by actually improving performance.
You talk about having Intelcompilers remove branches for alignment as if it had already been accomplished. This might be confusing when one drills down and sees the extra code branches.
You might mention the gcc setting -march=barcelona -msse4 in order to optimize for Core i7 (-msse3 if optimizing for bothi7 and those other CPUs).

The way to "remove" the extra branches for data alignment is to compile with /QxSSE4.2 (-xSSE4.2).

In the Presentation I tried to explain that "*Advanced Note: Intel SSE included movups but On previous IA-32 and Intel 64 processors movups was not the best way to load data from memory. Now on Intel Core i7 processors movups and movupd is as fast as movaps and movapd on aligned data so therefore when the Intel Compiler uses /QxSSE4.2 it removes the if condition to check for aligned data. Speeding up loops and reducing the # of instructions to implement.

More Info can be found at: http://intel.wingateweb.com/US08/published/sessions/NGMS002/SF08_NGMS002_100r.pdf

TimP · ‎03-05-2009

Quoting - Eric Moore (Intel)

In the Presentation I tried to explain that "*Advanced Note: Intel SSE included movups but On previous IA-32 and Intel 64 processors movups was not the best way to load data from memory. Now on Intel Core i7 processors movups and movupd is as fast as movaps and movapd on aligned data so therefore when the Intel Compiler uses /QxSSE4.2 it removes the if condition to check for aligned data. Speeding up loops and reducing the # of instructions to implement.

Eric,
The compiler doesn't remove versioning for alignment, even when both branches are identical. Compiler team people say this isn't feasible; it's not even committed for a future version. Yes, it would be desirable to remove an aligned branch when it differs only in using aligned vs unaligned loads, but, unless that actually happens, Iquestion the docs saying it does so. Yes, it is possible toedit the asm code, remove aligned branches, and demonstrate your premise.
The reasoning applies also to Barcelona and newer AMD processors, but the simplification would need to be available under -msse3 options to work for them.
Here are the Core i7 specific optimizations actually present in the newer compilers:
The xsse4.2 option sometimes suppresses full cache line unrolling which is done to avoid a cache line split on earlier CPUs, but doesn't do so consistently. This also fits the recommendation to use movups rather than lengthier alternatives.
sse4.2 optionalso engages thesse4 horizontal dot product instructions, such as dpps, but in some examples this only partly recovers performance which the 9.1 compilers gained by other means; itmay bebetter to change source code so as to avoid this.
One final thing; the sse4.2 option places compares correctly for compare-and-jump micro-op fusion.
These few changes don't carry much weight when there is any chance that the code may need to run on the sse4.1 CPUs, unless it can be assured that there will be no run-time fault due to invoking the newer compile option.

TimP · ‎05-20-2009

I've been asking questions about some of these aspects of optimization for Intel Core i7 and Xeon 5500 CPUs, and why there is little difference between the SSE4.1 and SSE4.2 implementation. By the way, I spent about a day finding out Intel approved terminology: Intel Microarchitecture code-named Nehalem. No, I haven't seen it so identified in VTune; expect Core i7 there, even though certain OEMs never use that terminology, which in any case is restricted to single socket platforms.
The topic seems more appropriate on the AVX forum, so I will post there.