CPI architecture/port concepts

srimks · ‎02-28-2009

Hi.

Whatare the concepts for designing ports within a processors to minimze CPI. Normally, for multi-core processors, something below CPI < ~1.0 is targetted for better performances. For Xeon processors, it has beensuggested CPI must be targetted around ~0.75 - 0.50.

Doesthis range would be or have been targetted much lower for Nehelem(45nm) or feature Sandy Bridge(35nm) processors?

The above question is very much aware about - how to improve CPI, the queryknows that procedure from programming point of view and also knows how to improve CPI with various events,here the question is framed from architecture or ports design point of view.

Note: CPI can get as low as 0.25 cycles per instructions with current Intel processors.

~BR

TimP · ‎02-28-2009

Minimiizing CPI is not a very interesting goal. For one thing, it excludes use of parallel instructions (e.g. vectorization). It is usually possible to use more instructions than would be required by the most efficient way to finish the job, if efficiency were measured by number of clock ticks or instructions. So, you minimize CPI by adding unnecessary instructions, only taking care that the extra instructions are faster than the useful ones.
Minimizing CPI could be even more counter-productive than maximizing threaded performance scaling by choosing a method which maximizes serial execution time. That kind of goal also excludes use of vectorization in combination with threaded parallel execution.
In MPI applications, the low CPI whiich you favor is seen in MPI_Wait spin loops, where CPI is immaterial, unless you see an advantage in maximizing the number of instructions. We do favor spin waits for profiling convenience, but in release configurations, sched_yield() (or some other Windows equivalent) is invoked after a small elapsed time, so as to give up the CPU to other potential use, rather than hogging it so as to execute and discard the maximum number of instructions. It's easy to increase number of instructions by setting the environment variables to increase the spin wait time priior to sched_yield.
Certain vendors' MPI even look at the general load on the system, preferring spin waits (high number of instructions, thus low CPI) when the system is dedicated to one task, but yielding the CPU sooner when other activity is detected. So then you can't get low CPI when the system is busy with multiple tasks, nor would you have any reason to try, when the objective is throughput, not low CPI.

srimks · ‎02-28-2009

Quoting - tim18

Minimiizing CPI is not a very interesting goal. For one thing, it excludes use of parallel instructions (e.g. vectorization). It is usually possible to use more instructions than would be required by the most efficient way to finish the job, if efficiency were measured by number of clock ticks or instructions. So, you minimize CPI by adding unnecessary instructions, only taking care that the extra instructions are faster than the useful ones.
Minimizing CPI could be even more counter-productive than maximizing threaded performance scaling by choosing a method which maximizes serial execution time. That kind of goal also excludes use of vectorization in combination with threaded parallel execution.
In MPI applications, the low CPI whiich you favor is seen in MPI_Wait spin loops, where CPI is immaterial, unless you see an advantage in maximizing the number of instructions. We do favor spin waits for profiling convenience, but in release configurations, sched_yield() (or some other Windows equivalent) is invoked after a small elapsed time, so as to give up the CPU to other potential use, rather than hogging it so as to execute and discard the maximum number of instructions. It's easy to increase number of instructions by setting the environment variables to increase the spin wait time priior to sched_yield.
Certain vendors' MPI even look at the general load on the system, preferring spin waits (high number of instructions, thus low CPI) when the system is dedicated to one task, but yielding the CPU sooner when other activity is detected. So then you can't get low CPI when the system is busy with multiple tasks, nor would you have any reason to try, when the objective is throughput, not low CPI.

Tim.

yeah, I totally agree with you and I have also seen by having Vectorization(explicit), the CPI for code executed for SMP systems increases and it also makes negative impact on Bus-Utilization. Articles (David Lavinthal) on VTune from Intel recommends to keep CPI within ~ 0.75 - 0.5, is this claim an emprical or a realistic benchmark, not sure?

Did ask (Intel) how they come up with such emprical concepts for CPI and relations for other cycle events, somehow don't see any relation mathematically neither any analytical proofs nor any concrete proofs from them.

But,the query I had asked, what are architectural/ports features in making CPI designed less than ~0.75 for multi-core processors and what would be the targetted range of CPI for Nehelem(45nm) & Sandy Bridge(35nm)?

~BR

TimP · ‎03-01-2009

Even if vectorization raises CPI from 0.7 to 0.9, it would usually be a clear win, as at least twice as much useful work is accomplished for each instruction. I think I disagree with your idea about "negative impact on bus utilization." Vectorization does tend to run up against bus bandwidth limitations. When care isn't taken to avoid splitting loops, the same data may be required to cross the bus again. In that case bus utilization numbers are useless; it clearly takes longer to do the job over multiple times. But, there is no virtue in reducing bus saturation by slowing down the rate at which useful work is accomplished, if that is what you advocate.
Nehalem is even more dependent on vectorization for good performance. I wouldn't be surprised to see increases in CPI on Sandy Bridge, with wider vector instructions.
I doubt that the designers of the new chips set reduced CPI in general, rather than real performance gains, as a goal. Intel learned from the P4 experience that marketing oriented goals like increasing CPU clock frequency and number of instructions executed for a given job, without regard to useful performance and power consumption, was not the best way to go. I can be derogatory when that message hasn't reached software performance workers.
The bug which began to be dealt with in linux compiler 11.0/081, where vectorized loops with multiple assignments were often distributed (split) down to a separate loop for each assignment, kept CPI artificially depressed. It could require 60% more than the optimum number of instructions to accomplish the job. So, you would never find the performance problem, if you looked only at CPI.

Peter_W_Intel · ‎03-01-2009

My comments are -

1. CPI ratio is not for first consideration to optimize your code - this is architecture level optimization.

2. For multi-core susytem, first consideration is Parallelism (algorithm level optimization).

Parallelism =CPU_CLK_UNHALTED.CORE/ CPU_CLK_UNHALTED.TOTAL_CYCLES * Number-of-Cores

Regards, Peter

srimks · ‎03-01-2009

Quoting - Peter Wang (Intel)

My comments are -

1. CPI ratio is not for first consideration to optimize your code - this is architecture level optimization.

2. For multi-core susytem, first consideration is Parallelism (algorithm level optimization).

Parallelism = CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.TOTAL_CYCLES * Number-of-Cores

Regards, Peter

Peter,

As qouted by you -

"For multi-core susytem, first consideration is Parallelism (algorithm level optimization).
Parallelism = CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.TOTAL_CYCLES * Number-of-Cores"

(a) In EBS selection of events, I don't see "CPU_CLK_UNHALTED.TOTAL_CYCLES" but the events I see are - CPU_CLK_UNHALTED.CORE.samples, CPU_CLK_UNHALTED.CORE% & CPU_CLK_UNHALTED.CORE.events only. Did I miss something?

(b) Which of above three events having suffix of CPU_CLK_UNHALTED.CORE.xxxxx, should I consider for "CPU_CLK_UNHALTED.CORE"?

(c) My system has Quad Core 5300 m/c., which means it has 2 die having 4 core each, so in total Quad core 5300 has 8 cores. So, I can check parallelism being succesful with above formula as said by you.

Do the value as obtained from above formula will suggest that parallelism has been 100% effective with Quad core 5300?

Could you qoute some examples.

~BR

srimks · ‎03-01-2009

Quoting - tim18

Even if vectorization raises CPI from 0.7 to 0.9, it would usually be a clear win, as at least twice as much useful work is accomplished for each instruction. I think I disagree with your idea about "negative impact on bus utilization." Vectorization does tend to run up against bus bandwidth limitations. When care isn't taken to avoid splitting loops, the same data may be required to cross the bus again. In that case bus utilization numbers are useless; it clearly takes longer to do the job over multiple times. But, there is no virtue in reducing bus saturation by slowing down the rate at which useful work is accomplished, if that is what you advocate.
Nehalem is even more dependent on vectorization for good performance. I wouldn't be surprised to see increases in CPI on Sandy Bridge, with wider vector instructions.
I doubt that the designers of the new chips set reduced CPI in general, rather than real performance gains, as a goal. Intel learned from the P4 experience that marketing oriented goals like increasing CPU clock frequency and number of instructions executed for a given job, without regard to useful performance and power consumption, was not the best way to go. I can be derogatory when that message hasn't reached software performance workers.
The bug which began to be dealt with in linux compiler 11.0/081, where vectorized loops with multiple assignments were often distributed (split) down to a separate loop for each assignment, kept CPI artificially depressed. It could require 60% more than the optimum number of instructions to accomplish the job. So, you would never find the performance problem, if you looked only at CPI.

Tim,

I had a case where CPI has gone beyond CPI > 1.0 with vectorization after also properly using Compiler calls too. Having or targetting CPI ~ 0.5 - 1.0 is good no doubt which I had been doing with proper tuning of code and did achieve in some succesfully.

Tx for your inputs.

~BR

TimP · ‎03-01-2009

Quoting - Peter Wang (Intel)

1. CPI ratio is not for first consideration to optimize your code - this is architecture level optimization.

2. For multi-core susytem, first consideration is Parallelism (algorithm level optimization).

Parallelism =CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.TOTAL_CYCLES * Number-of-Cores

Regards, Peter

CPI was listed as 2nd most important consideration on the .pdf posted here last week. Do you agree that's too high?
Effective use of parallel instructions (vectorization) should be undertaken before threaded parallel. I don't think you meant that, as it's not in your formula

srimks · ‎03-01-2009

Quoting - tim18

CPI was listed as 2nd most important consideration on the .pdf posted here last week. Do you agree that's too high?
Effective use of parallel instructions (vectorization) should be undertaken before threaded parallel. I don't think you meant that, as it's not in your formula

Tim/Peter.

As qouted "CPI was listed as 2nd most important consideration on the .pdf posted here last week."

Can I have the link of that pdf what Tim is talking about?

I have an article published on VTune which talks on CPi and other optimizations sinerios "Using Intel VTune Performance Analyzer Events/ Ratios & Optimizing Applications" http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/

~BR

Peter_W_Intel · ‎03-01-2009

Quoting - srimks

Peter,

As qouted by you -

"For multi-core susytem, first consideration is Parallelism (algorithm level optimization).
Parallelism = CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.TOTAL_CYCLES * Number-of-Cores"

(a) In EBS selection of events, I don't see "CPU_CLK_UNHALTED.TOTAL_CYCLES" but the events I see are - CPU_CLK_UNHALTED.CORE.samples, CPU_CLK_UNHALTED.CORE% & CPU_CLK_UNHALTED.CORE.events only. Did I miss something?

(b) Which of above three events having suffix of CPU_CLK_UNHALTED.CORE.xxxxx, should I consider for "CPU_CLK_UNHALTED.CORE"?

(c) My system has Quad Core 5300 m/c., which means it has 2 die having 4 core each, so in total Quad core 5300 has 8 cores. So, I can check parallelism being succesful with above formula as said by you.

Do the value as obtained from above formula will suggest that parallelism has been 100% effective with Quad core 5300?

Could you qoute some examples.

~BR

What I said for event names is for Intel Core 2 Duo processors.

I have no Intel Core 2 Quad process on hand - sorry, I can't provide you corresponding event names.

Regards, Peter

Peter_W_Intel · ‎03-01-2009

Quoting - Peter Wang (Intel)

What I said for event names is for Intel Core 2 Duo processors.

I have no Intel Core 2 Quad process on hand - sorry, I can't provide you corresponding event names.

Regards, Peter

Other thing I have to recommend that developer can use Intel Thread Profiler to know Concurrency Level (CL) in their code. Based on results, the developer can 1) Change their serial code to parallel; 2) Reduce overheads on sync-objects; 3) Reduce wait time; 4) Balance workload on each thread / processor, etc.

Regards, Peter

srimks · ‎03-02-2009

Quoting - Peter Wang (Intel)

What I said for event names is for Intel Core 2 Duo processors.

I have no Intel Core 2 Quad process on hand - sorry, I can't provide you corresponding event names.

Regards, Peter

Peter,

I think being an Intel guy, you shdn't answer the query in saying "I have no Intel Core 2 Quad process on hand - sorry, I can't provide you corresponding event names." rather you should direct to some other Intel guy who can answer the query w.r.t Quad Core 5300 and put this qoute " someone from Intel 'll be responding soon here rather making it negative".

Somehow, Intel people should take it seriously of responding the queries on ISN.

Probably, better answers and quick responses are given by non-Intel people in this ISN forum rather by Intel people themselves as myself being here in ISN forum for last 4 months had observed this. Really appreciate those people for their inputs and time.

~BR

Thomas_W_Intel · ‎03-02-2009

Quoting - srimks

Tim/Peter.

As qouted "CPI was listed as 2nd most important consideration on the .pdf posted here last week."

Can I have the link of that pdf what Tim is talking about?

I have an article published on VTune which talks on CPi and other optimizations sinerios "Using Intel VTune Performance Analyzer Events/ Ratios & Optimizing Applications" http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/

~BR

BR,

I think section 2of this articlecorresponds to what Peter was trying to point out: You need to ensure that your application is properly threaded (application level) before you start worrying aboutCPI (architecture level). VTune can assist you in verifying this, if you measure how many of the available clockticks you are actually using. Intel Thread Profiler is another tool that can help you in this stage.

CPI is merely a measure of how well the hardware is able to execute the instruction flow. Looking at the CPI may guide you to the portions of your code where you can take better advantage of the underlying CPU architecture. However, CPI doesn't tellhow useful theexecuted instructions actually are. For example, a different algorithm might result in a way better running time -- and at the end of the day, this is what you care about, isn't it? Similarly, different instructions like vector instructions can improve your running time. If your CPI increases but your running time decreases by switching to vector instructions, who cares?

Having a high CPI just tells you, that there is room for improvement on the architectural level. It doesn't tell you that there isn't any other way to improve the application.

The value 0.75-0.5 is based on experience of what you can achieve in well-tuned CPU-bound applications. In other words, if you already have a CPI of 0.5 for a function, don't be frustrated if you cannot improve on that. On the other hand, if you have a function with a CPI of 10 and it is one of the hot functions in your application and you have exploited all the other means to improve on a system and an application level, then you should look into this.

Kind regards

Thomas

Peter_W_Intel · ‎03-03-2009

Quoting - srimks

Peter,

As qouted by you -

"For multi-core susytem, first consideration is Parallelism (algorithm level optimization).
Parallelism = CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.TOTAL_CYCLES * Number-of-Cores"

(a) In EBS selection of events, I don't see "CPU_CLK_UNHALTED.TOTAL_CYCLES" but the events I see are - CPU_CLK_UNHALTED.CORE.samples, CPU_CLK_UNHALTED.CORE% & CPU_CLK_UNHALTED.CORE.events only. Did I miss something?

(b) Which of above three events having suffix of CPU_CLK_UNHALTED.CORE.xxxxx, should I consider for "CPU_CLK_UNHALTED.CORE"?

(c) My system has Quad Core 5300 m/c., which means it has 2 die having 4 core each, so in total Quad core 5300 has 8 cores. So, I can check parallelism being succesful with above formula as said by you.

Do the value as obtained from above formula will suggest that parallelism has been 100% effective with Quad core 5300?

Could you qoute some examples.

~BR

Hi,

Today I found a Intel Core 2 Quad machine, and ensure that eventCPU_CLK_UNHALTED.TOTAL_CYCLES exists in this system (Actuallyevents inCore 2 Quad are similar as Core 2 Duo).

Do you use latest product v9.1 Update 1?

I think that 5300 is Core 2 Quad, T5300 is Core 2 Duo, E5300 is Pentium (which has noCPU_CLK_UNHALTED.TOTAL_CYCLES)

You can use vtl command to export supported events name in your system - "vtl query -c sampling" to check ifCPU_CLK_UNHALTED.TOTAL_CYCLES exists.

Regards, Peter

srimks · ‎03-03-2009

Quoting - Peter Wang (Intel)

Hi,

Today I found a Intel Core 2 Quad machine, and ensure that event CPU_CLK_UNHALTED.TOTAL_CYCLES exists in this system (Actually events in Core 2 Quad are similar as Core 2 Duo).

Do you use latest product v9.1 Update 1?

I think that 5300 is Core 2 Quad, T5300 is Core 2 Duo, E5300 is Pentium (which has no CPU_CLK_UNHALTED.TOTAL_CYCLES)

You can use vtl command to export supported events name in your system - "vtl query -c sampling" to check if CPU_CLK_UNHALTED.TOTAL_CYCLES exists.

Regards, Peter

Peter,

I tried below as suggested but had below message -
-----
$ vtl query -c sampling
VTune Performance Analyzer 9.1 for Linux* build 152
Copyright (C) 2000-2008 Intel Corporation. All rights reserved.

Could not get NUM_PHYSICAL_CPUS value from environment XML file
-----

The processor used by me is "Intel Xeon CPU X5355 @ 2.66GHz", which is 8 core m/c.

Only events which I see with this m/c. on GUI mode are -

CPU_CLK_UNHALTED.CORE%
CPU_CLK_UNHALTED.BUS%
CPU_CLK_UNHALTED.CORE.events
CPU_CLK_UNHALTED.BUS.events
CPU_CLK_UNHALTED.BUS.samples

I did had performed the all EBS events from - Advance Performance Tuning, Basic Tuning, etc. with Vtune-v9.1.

Could you suggest the same thing of Parallelism as qouted by you with "Intel Xeon CPU X5355 @ 2.66GHz" m/c. -

"Parallelism =CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.TOTAL_CYCLES * Number-of-Cores"

~BR

Peter_W_Intel · ‎03-03-2009

Quoting - srimks

Peter,

I tried below as suggested but had below message -
-----
$ vtl query -c sampling
VTune Performance Analyzer 9.1 for Linux* build 152
Copyright (C) 2000-2008 Intel Corporation. All rights reserved.

Could not get NUM_PHYSICAL_CPUS value from environment XML file
-----

The processor used by me is "Intel Xeon CPU X5355 @ 2.66GHz", which is 8 core m/c.

Only events which I see with this m/c. on GUI mode are -

CPU_CLK_UNHALTED.CORE%
CPU_CLK_UNHALTED.BUS%
CPU_CLK_UNHALTED.CORE.events
CPU_CLK_UNHALTED.BUS.events
CPU_CLK_UNHALTED.BUS.samples

I did had performed the all EBS events from - Advance Performance Tuning, Basic Tuning, etc. with Vtune-v9.1.

Could you suggest the same thing of Parallelism as qouted by you with "Intel Xeon CPU X5355 @ 2.66GHz" m/c. -

"Parallelism =CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.TOTAL_CYCLES * Number-of-Cores"

~BR

Thanks for your detail info of processor - this is a Intel Xeon Processor, a quad-core server processor which was launched before 2 years. This processor is first generation product for Dual Core architecture - is NOT in Intel Core 2 Quad family. That is why there is no event namedCPU_CLK_UNHALTED.TOTAL_CYCLES

In my other thread, I suggest you to use Intel Thread Profiler to know your code'sparallelism.

Thanks, Peter

Thomas_W_Intel · ‎03-04-2009

Quoting - Peter Wang (Intel)

Thanks for your detail info of processor - this is a Intel Xeon Processor, a quad-core server processor which was launched before 2 years. This processor is first generation product for Dual Core architecture - is NOT in Intel Core 2 Quad family. That is why there is no event namedCPU_CLK_UNHALTED.TOTAL_CYCLES

In my other thread, I suggest you to use Intel Thread Profiler to know your code'sparallelism.

Thanks, Peter

You can also get an impression of the concurrency level of you application by looking at the "Sampling Over Time" view in VTune. It allows you to depict which threads are working over time. In order to use it on Linux, you need to use VTune 9.0 update 1 or later and set the environment variable VTUNE_OVER_TIME.

There are certain pitfalls with this methodology, e..g. you might the impression thatall threads are working, but in fact there are waiting on busy looks. But even in case you see that a thread is waiting, it is usually hard to identify using sampling, why the thread is waiting.

The advantage over thread profiler is that the complete system is monitored. This is important if there are several applications involved. Furthermore, the overhead is lower and you can restrict your measurement to a time intervall instead of the complete run.

Kind regards
Thomas

srimks · ‎03-04-2009

Quoting - Thomas Willhalm (Intel)

You can also get an impression of the concurrency level of you application by looking at the "Sampling Over Time" view in VTune. It allows you to depict which threads are working over time. In order to use it on Linux, you need to use VTune 9.0 update 1 or later and set the environment variable VTUNE_OVER_TIME.

There are certain pitfalls with this methodology, e..g. you might the impression that all threads are working, but in fact there are waiting on busy looks. But even in case you see that a thread is waiting, it is usually hard to identify using sampling, why the thread is waiting.

The advantage over thread profiler is that the complete system is monitored. This is important if there are several applications involved. Furthermore, the overhead is lower and you can restrict your measurement to a time intervall instead of the complete run.

Kind regards
Thomas

Thanks to all for their responses but the query in the beginning as asked was -

"Does this range would be or have been targetted much lower for Nehelem(45nm) or feature Sandy Bridge(35nm) processors?" which basically was to focus what are the experimented RATIO & LIMITS number for commonly used EBS events for Nehalem (Core i7)?

The events RATIO & LIMITS as presented in http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/ does demonstrate Nehalem Core i7 VTune analysis, please confirm?

The number as mentioned in above link if those are emprical or does it carries any justification with analysis done with different test cases of an application executed.

Please confirm?

~BR

Thomas_W_Intel · ‎03-05-2009

Quoting - srimks

Thanks to all for their responses but the query in the beginning as asked was -

"Does this range would be or have been targetted much lower for Nehelem(45nm) or feature Sandy Bridge(35nm) processors?" which basically was to focus what are the experimented RATIO & LIMITS number for commonly used EBS events for Nehalem (Core i7)?

The events RATIO & LIMITS as presented in http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/ does demonstrate Nehalem Core i7 VTune analysis, please confirm?

The number as mentioned in above link if those are emprical or does it carries any justification with analysis done with different test cases of an application executed.

Please confirm?

~BR

Therangefor CPI is still the same for Core i7 (Nehalem) and theoretical limit is still 0.25.The recommendationis based on the measurements with well-tuned CPU-bound applications.

Other ratios in this text do not apply to Core i7 anymore, e.g. the FSB is replaced by QPI withcompletely different events in the uncore.

srimks · ‎03-05-2009

Quoting - Thomas Willhalm (Intel)

Therangefor CPI is still the same for Core i7 (Nehalem) and theoretical limit is still 0.25.The recommendationis based on the measurements with well-tuned CPU-bound applications.

Other ratios in this text do not apply to Core i7 anymore, e.g. the FSB is replaced by QPI withcompletely different events in the uncore.

Thomas,

Thanks for making a correction w.r.t this article on LIMITS & RATIOS for Core i7. Could you elaborate more on "the FSB is replaced by QPI with completely different events in the uncore". What is uncore here?

True, FSB has been replaced by QPI, so the resultsfor LIMITS & RATIOS numbers will be modified. In this article
45nm Hi-k" has been said,which also refers to Nehalem, please correct?

~BR

srimks · ‎03-05-2009

Quoting - Thomas Willhalm (Intel)

Quoting - srimks

Thanks to all for their responses but the query in the beginning as asked was -

"Does this range would be or have been targetted much lower for Nehelem(45nm) or feature Sandy Bridge(35nm) processors?" which basically was to focus what are the experimented RATIO & LIMITS number for commonly used EBS events for Nehalem (Core i7)?

The events RATIO & LIMITS as presented in http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/ does demonstrate Nehalem Core i7 VTune analysis, please confirm?

The number as mentioned in above link if those are emprical or does it carries any justification with analysis done with different test cases of an application executed.

Please confirm?

~BR

The range for CPI is still the same for Core i7 (Nehalem) and theoretical limit is still 0.25. The recommendation is based on the measurements with well-tuned CPU-bound applications.

Other ratios in this text do not apply to Core i7 anymore, e.g. the FSB is replaced by QPI with completely different events in the uncore.

Thomas/Peter,

Nehalem (Intel Core i7 ) being one of the Intel latest processor in "45nm Hi-k" silicon based technology. While refering the links for Nehalem, it seems that in Nehalem, the old tradition of having FSB (Front Side Bus) in Intel processor has been removed by incorporating QPI (Quick Path Interconnect).

This article discusses the LIMITS & RATIOS of events w.r.t FSB, so this analysis can't be considerd for Nehalem but this article certainly gives an insights of key EBS events to be taken care while using VTune for profiling an application for a micro-architecture.

The only thing which can be considered from this article about Nehalem is theoritical limit of CPI ~ 0.25 as qouted by you(Thomas), remaining contents of LIMITS & RATIOS can't be considered for Nehalem as mentioned in this article because this article doesn't consider measurement done w.r.t QPI.

I haven't seen any VTune articles (by David Levinthal nor by someone from Intel) distingusihing the VTune profiling analysis w.r.t specific Intel muti-core processors (Intel Xeon, Core Duo, Core2 Duo, Core2 Quad, Core i7, Intel Pentium D and Pentium) for better understanding of profiling using VTune for Intel users. I think Intel should think in publishing arcticles on VTune by being specific to muti-core processors EBS events for better learning for it's users.

~BR