CPI architecture/port concepts - Page 2

srimks · ‎02-28-2009

Hi.

Whatare the concepts for designing ports within a processors to minimze CPI. Normally, for multi-core processors, something below CPI < ~1.0 is targetted for better performances. For Xeon processors, it has beensuggested CPI must be targetted around ~0.75 - 0.50.

Doesthis range would be or have been targetted much lower for Nehelem(45nm) or feature Sandy Bridge(35nm) processors?

The above question is very much aware about - how to improve CPI, the queryknows that procedure from programming point of view and also knows how to improve CPI with various events,here the question is framed from architecture or ports design point of view.

Note: CPI can get as low as 0.25 cycles per instructions with current Intel processors.

~BR

robert-reed · ‎03-06-2009

Quoting - srimks

Nehalem (Intel Core i7 ) being one of the Intel latest processor in "45nm Hi-k" silicon based technology. While refering the links for Nehalem, it seems that in Nehalem, the old tradition of having FSB (Front Side Bus) in Intel processor has been removed by incorporating QPI (Quick Path Interconnect).

The LIMITS and RATIOS article has a lot of good information but is already starting to show its age. The use of "45nm Hi-k Intel Core processor" is clearly a way ofwriting a public article about Nehalem before the Intel Core i7 nomenclature was announced, and the article should be reedited at least to use Intel Core i7 (you can find the same thing in someVTune analyzer documentation, or at least the "45nm" part). The focus on CPI as the first thing to look at when doing performance tuning, which has been of value in certain contexts, is less important as a general diagnostic aid. One place where itwasused a lot was in transaction processing,but usually in close association with path length (the average number of instructions in the transaction processing loop) and only because CPI * path length = a measure of time (some number of cycles) which is the average transaction time. The goal then is to minimize both CPI and path length to improve performance. Achieving the minimum CPI, which hasn't changed from Intel Core to Intel Core i7, is an idealized and impossible goal, since it would mean retiring four instructions every cycle (these are all four-wide issue machines), and would require a lot of processing to cover the latency of even a single memory reference. (To be precise, these are four-wide issue micro-instruction architectures, but many instructions translate to a single micro-instruction, so the numbers are usually close.)

Much of this primary focus on CPI has been superceded by newer techniques like looking for stalls coincident to hot spots. And newer techniques are coming online as Performance Monitoring Unit improvements become available, so the picture will evolve. However, the more things change, the more they stay the same. Among the new features of Intel Core i7 is QPI, but it still bears the same general relation to the core architecture even though it connects to different and more things (along with the integrated memory controllers) which complicate the forumulae, but there may be similar ratios (e.g., QPI saturation?) which may have bearing on some kinds of stalls.

I haven't seen any VTune articles (by David Levinthal nor by someone from Intel) distingusihing the VTune profiling analysis w.r.t specific Intel muti-core processors (Intel Xeon, Core Duo, Core2 Duo, Core2 Quad, Core i7, Intel Pentium D and Pentium) for better understanding of profiling using VTune for Intel users. I think Intel should think in publishing arcticles on VTune by being specific to muti-core processors EBS events for better learning for it's users.

I assume you are aware of the reference section of the VTune analyzer documentation called Processor Events and Advice? It may not havethe tutorial level you're looking foror assemble all the formulae together into some complete, cycle accounting whole, but it's a start. More should be forthcoming as we dig deeper and find the time to write about it. Stay on us. We appreciate your enthusiasm.

srimks · ‎03-07-2009

Quoting - Robert Reed (Intel)

The LIMITS and RATIOS article has a lot of good information but is already starting to show its age. The use of "45nm Hi-k Intel Core processor" is clearly a way ofwriting a public article about Nehalem before the Intel Core i7 nomenclature was announced, and the article should be reedited at least to use Intel Core i7 (you can find the same thing in someVTune analyzer documentation, or at least the "45nm" part). The focus on CPI as the first thing to look at when doing performance tuning, which has been of value in certain contexts, is less important as a general diagnostic aid. One place where itwasused a lot was in transaction processing,but usually in close association with path length (the average number of instructions in the transaction processing loop) and only because CPI * path length = a measure of time (some number of cycles) which is the average transaction time. The goal then is to minimize both CPI and path length to improve performance. Achieving the minimum CPI, which hasn't changed from Intel Core to Intel Core i7, is an idealized and impossible goal, since it would mean retiring four instructions every cycle (these are all four-wide issue machines), and would require a lot of processing to cover the latency of even a single memory reference. (To be precise, these are four-wide issue micro-instruction architectures, but many instructions translate to a single micro-instruction, so the numbers are usually close.)

Much of this primary focus on CPI has been superceded by newer techniques like looking for stalls coincident to hot spots. And newer techniques are coming online as Performance Monitoring Unit improvements become available, so the picture will evolve. However, the more things change, the more they stay the same. Among the new features of Intel Core i7 is QPI, but it still bears the same general relation to the core architecture even though it connects to different and more things (along with the integrated memory controllers) which complicate the forumulae, but there may be similar ratios (e.g., QPI saturation?) which may have bearing on some kinds of stalls.

I haven't seen any VTune articles (by David Levinthal nor by someone from Intel) distingusihing the VTune profiling analysis w.r.t specific Intel muti-core processors (Intel Xeon, Core Duo, Core2 Duo, Core2 Quad, Core i7, Intel Pentium D and Pentium) for better understanding of profiling using VTune for Intel users. I think Intel should think in publishing arcticles on VTune by being specific to muti-core processors EBS events for better learning for it's users.

I assume you are aware of the reference section of the VTune analyzer documentation called Processor Events and Advice? It may not havethe tutorial level you're looking foror assemble all the formulae together into some complete, cycle accounting whole, but it's a start. More should be forthcoming as we dig deeper and find the time to write about it. Stay on us. We appreciate your enthusiasm.

Hello Peter/Thomas/Robert.

Thanks for responding.

I am looking to do some profiling for a 8,000-10,000 lines of multi CPP file applicationon Nehalem, and compare with "Intel Xeon CPU X5355 @ 2.66GHz" processor. Could you suggest some key things needed to be compared for both processors and finally the key things to check performanceon Nehalem as Nehalemprocessor has some new features incomparison of Intel old processors.

Do you think, I should start a new thread (VTune Profiling on Nehalem) or in this thread only I can go ahead.

~BR
Mukkaysh Srivastav

robert-reed · ‎03-16-2009

Quoting - srimks

I am looking to do some profiling for a 8,000-10,000 lines of multi CPP file applicationon Nehalem, and compare with "Intel Xeon CPU X5355 @ 2.66GHz" processor. Could you suggest some key things needed to be compared for both processors and finally the key things to check performanceon Nehalem as Nehalemprocessor has some new features incomparison of Intel old processors.

Do you think, I should start a new thread (VTune Profiling on Nehalem) or in this thread only I can go ahead.

If you want to see how a program compares on two architectures, the place to start is with a comparative hot spot analysis of the same program running on comparable instances of thearchitectures, to see how individual functions scale. You might see a uniform scaling or you might see some hot spots get hotter or cooler. Focus on those and drill down to the source code level to find the regions that are taking more or less time within the function. These changing hot spots are the most important to understand, since they'll have the biggest effect on your program. Cycle accounting, stall analysis, it's been called various things, but figuring out what's delayingthe instructions is the next step. Though the architectures are different, they are also similar and have similar debug events that may be more or less effective in determining the state of the corresponding stages: all have a front end (instruction decoding) and a back end (resource scheduling, dispatch, retirement), but the number of events of interest, particularly with Intel Core^TM i7 processor, is too large to enumerate here.

A couple Intel tools provide the means to directly compare runs. Both Intel Parallel Amplifier and PTU (available on whatif.intel.com for VTune^TM analyzer license holders) offer tools to compare runs. PTU also comes with some predefined sample groups to collect events of significance, called configurations, which use selected events per architecture. Besides the basic collections, the one I'm looking at has six special configurations for Intel Core 2 processors and ten for the Intel Core i7 processor. These configurations are provided to look for specific types of stalls. For example, the Intel Core 2 processor configuration called "Bandwidth" collects BUS_DRDY_CLOCKS.THIS_AGENT, BUS_TRANS_BURST.SELF and the ubiquitous CPU_CLK_UNHALTED.CORE. This sounds pretty close to what you're looking for.

Thomas_W_Intel · ‎03-16-2009

Quoting - srimks

Thomas/Peter,

Nehalem (Intel Core i7 ) being one of the Intel latest processor in "45nm Hi-k" silicon based technology. While refering the links for Nehalem, it seems that in Nehalem, the old tradition of having FSB (Front Side Bus) in Intel processor has been removed by incorporating QPI (Quick Path Interconnect).

This article discusses the LIMITS & RATIOS of events w.r.t FSB, so this analysis can't be considerd for Nehalem but this article certainly gives an insights of key EBS events to be taken care while using VTune for profiling an application for a micro-architecture.

The only thing which can be considered from this article about Nehalem is theoritical limit of CPI ~ 0.25 as qouted by you(Thomas), remaining contents of LIMITS & RATIOS can't be considered for Nehalem as mentioned in this article because this article doesn't consider measurement done w.r.t QPI.

I haven't seen any VTune articles (by David Levinthal nor by someone from Intel) distingusihing the VTune profiling analysis w.r.t specific Intel muti-core processors (Intel Xeon, Core Duo, Core2 Duo, Core2 Quad, Core i7, Intel Pentium D and Pentium) for better understanding of profiling using VTune for Intel users. I think Intel should think in publishing arcticles on VTune by being specific to muti-core processors EBS events for better learning for it's users.

~BR

BR,

the latest version of VTune, Intel VTune Performance Analyzer 9.1 Update 2 for Linux, contains predefined ratios for Intel Core i7 processors. Thisnot quite yet what you are looking for, but we are heading in this direction.

Kind regards
Thomas

srimks · ‎03-16-2009

Quoting - Robert Reed (Intel)

If you want to see how a program compares on two architectures, the place to start is with a comparative hot spot analysis of the same program running on comparable instances of thearchitectures, to see how individual functions scale. You might see a uniform scaling or you might see some hot spots get hotter or cooler. Focus on those and drill down to the source code level to find the regions that are taking more or less time within the function. These changing hot spots are the most important to understand, since they'll have the biggest effect on your program. Cycle accounting, stall analysis, it's been called various things, but figuring out what's delayingthe instructions is the next step. Though the architectures are different, they are also similar and have similar debug events that may be more or less effective in determining the state of the corresponding stages: all have a front end (instruction decoding) and a back end (resource scheduling, dispatch, retirement), but the number of events of interest, particularly with Intel Core^TM i7 processor, is too large to enumerate here.

A couple Intel tools provide the means to directly compare runs. Both Intel Parallel Amplifier and PTU (available on whatif.intel.com for VTune^TM analyzer license holders) offer tools to compare runs. PTU also comes with some predefined sample groups to collect events of significance, called configurations, which use selected events per architecture. Besides the basic collections, the one I'm looking at has six special configurations for Intel Core 2 processors and ten for the Intel Core i7 processor. These configurations are provided to look for specific types of stalls. For example, the Intel Core 2 processor configuration called "Bandwidth" collects BUS_DRDY_CLOCKS.THIS_AGENT, BUS_TRANS_BURST.SELF and the ubiquitous CPU_CLK_UNHALTED.CORE. This sounds pretty close to what you're looking for.

Thanks Robert/Thomas.

Will certainly look to explore Core i7 using Intel VTune - v9.1(Update 2) with what you suggested.

~BR
Mukkaysh Srivastav

srimks · ‎03-20-2009

Quoting - Peter Wang (Intel)

Hi,

Today I found a Intel Core 2 Quad machine, and ensure that eventCPU_CLK_UNHALTED.TOTAL_CYCLES exists in this system (Actuallyevents inCore 2 Quad are similar as Core 2 Duo).

Do you use latest product v9.1 Update 1?

I think that 5300 is Core 2 Quad, T5300 is Core 2 Duo, E5300 is Pentium (which has noCPU_CLK_UNHALTED.TOTAL_CYCLES)

You can use vtl command to export supported events name in your system - "vtl query -c sampling" to check ifCPU_CLK_UNHALTED.TOTAL_CYCLES exists.

Regards, Peter

Peter, Thanks.

I checked with Nehalem(Core i7) it has this event and also Core i7 is being populated with many sampling events which were not there in older Intel processors. I also see some SIMD related sampling events.

Thanks to Intel developers team who had brought thesesampling events for Core i7 processor.

~BR