Whatare the concepts for designing ports within a processors to minimze CPI. Normally, for multi-core processors, something below CPI < ~1.0 is targetted for better performances. For Xeon processors, it has beensuggested CPI must be targetted around ~0.75 - 0.50.
Doesthis range would be or have been targetted much lower for Nehelem(45nm) or feature Sandy Bridge(35nm) processors?
The above question is very much aware about - how to improve CPI, the queryknows that procedure from programming point of view and also knows how to improve CPI with various events,here the question is framed from architecture or ports design point of view.
Note: CPI can get as low as 0.25 cycles per instructions with current Intel processors.
Minimizing CPI could be even more counter-productive than maximizing threaded performance scaling by choosing a method which maximizes serial execution time. That kind of goal also excludes use of vectorization in combination with threaded parallel execution.
In MPI applications, the low CPI whiich you favor is seen in MPI_Wait spin loops, where CPI is immaterial, unless you see an advantage in maximizing the number of instructions. We do favor spin waits for profiling convenience, but in release configurations, sched_yield() (or some other Windows equivalent) is invoked after a small elapsed time, so as to give up the CPU to other potential use, rather than hogging it so as to execute and discard the maximum number of instructions. It's easy to increase number of instructions by setting the environment variables to increase the spin wait time priior to sched_yield.
Certain vendors' MPI even look at the general load on the system, preferring spin waits (high number of instructions, thus low CPI) when the system is dedicated to one task, but yielding the CPU sooner when other activity is detected. So then you can't get low CPI when the system is busy with multiple tasks, nor would you have any reason to try, when the objective is throughput, not low CPI.
yeah, I totally agree with you and I have also seen by having Vectorization(explicit), the CPI for code executed for SMP systems increases and it also makes negative impact on Bus-Utilization. Articles (David Lavinthal) on VTune from Intel recommends to keep CPI within ~ 0.75 - 0.5, is this claim an emprical or a realistic benchmark, not sure?
Did ask (Intel) how they come up with such emprical concepts for CPI and relations for other cycle events, somehow don't see any relation mathematically neither any analytical proofs nor any concrete proofs from them.
But,the query I had asked, what are architectural/ports features in making CPI designed less than ~0.75 for multi-core processors and what would be the targetted range of CPI for Nehelem(45nm) & Sandy Bridge(35nm)?
Nehalem is even more dependent on vectorization for good performance. I wouldn't be surprised to see increases in CPI on Sandy Bridge, with wider vector instructions.
I doubt that the designers of the new chips set reduced CPI in general, rather than real performance gains, as a goal. Intel learned from the P4 experience that marketing oriented goals like increasing CPU clock frequency and number of instructions executed for a given job, without regard to useful performance and power consumption, was not the best way to go. I can be derogatory when that message hasn't reached software performance workers.
The bug which began to be dealt with in linux compiler 11.0/081, where vectorized loops with multiple assignments were often distributed (split) down to a separate loop for each assignment, kept CPI artificially depressed. It could require 60% more than the optimum number of instructions to accomplish the job. So, you would never find the performance problem, if you looked only at CPI.
As qouted by you -
"For multi-core susytem, first consideration is Parallelism (algorithm level optimization).
Parallelism = CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.TOTAL_CYCLES * Number-of-Cores"
(a) In EBS selection of events, I don't see "CPU_CLK_UNHALTED.TOTAL_CYCLES" but the events I see are - CPU_CLK_UNHALTED.CORE.samples, CPU_CLK_UNHALTED.CORE% & CPU_CLK_UNHALTED.CORE.events only. Did I miss something?
(b) Which of above three events having suffix of CPU_CLK_UNHALTED.CORE.xxxxx, should I consider for "CPU_CLK_UNHALTED.CORE"?
(c) My system has Quad Core 5300 m/c., which means it has 2 die having 4 core each, so in total Quad core 5300 has 8 cores. So, I can check parallelism being succesful with above formula as said by you.
Do the value as obtained from above formula will suggest that parallelism has been 100% effective with Quad core 5300?
Could you qoute some examples.
I had a case where CPI has gone beyond CPI > 1.0 with vectorization after also properly using Compiler calls too. Having or targetting CPI ~ 0.5 - 1.0 is good no doubt which I had been doing with proper tuning of code and did achieve in some succesfully.
Tx for your inputs.
Effective use of parallel instructions (vectorization) should be undertaken before threaded parallel. I don't think you meant that, as it's not in your formula
As qouted "CPI was listed as 2nd most important consideration on the .pdf posted here last week."
Can I have the link of that pdf what Tim is talking about?
I have an article published on VTune which talks on CPi and other optimizations sinerios "Using Intel VTune Performance Analyzer Events/ Ratios & Optimizing Applications" http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimi...
I think being an Intel guy, you shdn't answer the query in saying "I have no Intel Core 2 Quad process on hand - sorry, I can't provide you corresponding event names." rather you should direct to some other Intel guy who can answer the query w.r.t Quad Core 5300 and put this qoute " someone from Intel 'll be responding soon here rather making it negative".
Somehow, Intel people should take it seriously of responding the queries on ISN.
Probably, better answers and quick responses are given by non-Intel people in this ISN forum rather by Intel people themselves as myself being here in ISN forum for last 4 months had observed this. Really appreciate those people for their inputs and time.
I think section 2of this articlecorresponds to what Peter was trying to point out: You need to ensure that your application is properly threaded (application level) before you start worrying aboutCPI (architecture level). VTune can assist you in verifying this, if you measure how many of the available clockticks you are actually using. Intel Thread Profiler is another tool that can help you in this stage.
CPI is merely a measure of how well the hardware is able to execute the instruction flow. Looking at the CPI may guide you to the portions of your code where you can take better advantage of the underlying CPU architecture. However, CPI doesn't tellhow useful theexecuted instructions actually are. For example, a different algorithm might result in a way better running time -- and at the end of the day, this is what you care about, isn't it? Similarly, different instructions like vector instructions can improve your running time. If your CPI increases but your running time decreases by switching to vector instructions, who cares?
Having a high CPI just tells you, that there is room for improvement on the architectural level. It doesn't tell you that there isn't any other way to improve the application.
The value 0.75-0.5 is based on experience of what you can achieve in well-tuned CPU-bound applications. In other words, if you already have a CPI of 0.5 for a function, don't be frustrated if you cannot improve on that. On the other hand, if you have a function with a CPI of 10 and it is one of the hot functions in your application and you have exploited all the other means to improve on a system and an application level, then you should look into this.
I tried below as suggested but had below message -
$ vtl query -c sampling
VTune Performance Analyzer 9.1 for Linux* build 152
Copyright (C) 2000-2008 Intel Corporation. All rights reserved.
Could not get NUM_PHYSICAL_CPUS value from environment XML file
The processor used by me is "Intel Xeon CPU X5355 @ 2.66GHz", which is 8 core m/c.
Only events which I see with this m/c. on GUI mode are -
I did had performed the all EBS events from - Advance Performance Tuning, Basic Tuning, etc. with Vtune-v9.1.
Could you suggest the same thing of Parallelism as qouted by you with "Intel Xeon CPU X5355 @ 2.66GHz" m/c. -
"Parallelism =CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.TOTAL_CYCLES * Number-of-Cores"
You can also get an impression of the concurrency level of you application by looking at the "Sampling Over Time" view in VTune. It allows you to depict which threads are working over time. In order to use it on Linux, you need to use VTune 9.0 update 1 or later and set the environment variable VTUNE_OVER_TIME.
There are certain pitfalls with this methodology, e..g. you might the impression thatall threads are working, but in fact there are waiting on busy looks. But even in case you see that a thread is waiting, it is usually hard to identify using sampling, why the thread is waiting.
The advantage over thread profiler is that the complete system is monitored. This is important if there are several applications involved. Furthermore, the overhead is lower and you can restrict your measurement to a time intervall instead of the complete run.
Thanks to all for their responses but the query in the beginning as asked was -
"Does this range would be or have been targetted much lower for Nehelem(45nm) or feature Sandy Bridge(35nm) processors?" which basically was to focus what are the experimented RATIO & LIMITS number for commonly used EBS events for Nehalem (Core i7)?
The events RATIO & LIMITS as presented in http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimi... does demonstrate Nehalem Core i7 VTune analysis, please confirm?
The number as mentioned in above link if those are emprical or does it carries any justification with analysis done with different test cases of an application executed.
Therangefor CPI is still the same for Core i7 (Nehalem) and theoretical limit is still 0.25.The recommendationis based on the measurements with well-tuned CPU-bound applications.
Other ratios in this text do not apply to Core i7 anymore, e.g. the FSB is replaced by QPI withcompletely different events in the uncore.
Thanks for making a correction w.r.t this article on LIMITS & RATIOS for Core i7. Could you elaborate more on "the FSB is replaced by QPI with completely different events in the uncore". What is uncore here?
True, FSB has been replaced by QPI, so the resultsfor LIMITS & RATIOS numbers will be modified. In this article
45nm Hi-k" has been said,which also refers to Nehalem, please correct?
Nehalem (Intel Core i7 ) being one of the Intel latest processor in "45nm Hi-k" silicon based technology. While refering the links for Nehalem, it seems that in Nehalem, the old tradition of having FSB (Front Side Bus) in Intel processor has been removed by incorporating QPI (Quick Path Interconnect).
This article discusses the LIMITS & RATIOS of events w.r.t FSB, so this analysis can't be considerd for Nehalem but this article certainly gives an insights of key EBS events to be taken care while using VTune for profiling an application for a micro-architecture.
The only thing which can be considered from this article about Nehalem is theoritical limit of CPI ~ 0.25 as qouted by you(Thomas), remaining contents of LIMITS & RATIOS can't be considered for Nehalem as mentioned in this article because this article doesn't consider measurement done w.r.t QPI.
I haven't seen any VTune articles (by David Levinthal nor by someone from Intel) distingusihing the VTune profiling analysis w.r.t specific Intel muti-core processors (Intel Xeon, Core Duo, Core2 Duo, Core2 Quad, Core i7, Intel Pentium D and Pentium) for better understanding of profiling using VTune for Intel users. I think Intel should think in publishing arcticles on VTune by being specific to muti-core processors EBS events for better learning for it's users.