Re: ATTN: Lexi, Re: Why are multipliers low-locked?

elesueur · ‎06-12-2008

This may seem a silly question on the surface, but I think it's still valid...

I have a Xeon X3353, which has a maximum core clock of 2.667GHz.

It has 3 P-States, P0 (2.66GHz), P1 (2.33GHz) and P2 (2.00GHz) which correspond to multipliers of 6, 7 and 8 respectively, and VID's of 1A, 1D and 20.

Why is the lowest multiplier 6? Why can't I go to 1.667GHz or any lower?

It seems pointless to provide frequency scaling when the range is so minimal. If I want to save real energy, I need to be able to drop it down to 333MHz when running horrible memory bound applications - why is EIST so limiting in that respect?

I can understand why the upper multiplier would be locked - so people can't overclock and get performance for free, but sure lowering it doesn't have any consequences for anybody...

levicki · ‎06-24-2008

I am not an Intel engineer but I will try to explain:

In order to save power it is not enough to decrease the frequency, you need to decrease voltage as well. There is a certain voltage point (let's say 0.825V) below which CPU won't be able to operate at all, much less reliably. Given that today's CPUs already have low operating voltages to begin with, there is only so much (or so little) saving headroom available.

elesueur · ‎06-24-2008

I'm aware of the dependence on voltage to run stably, but remember that frequency is an important (although not as much as voltage) part of the power equation,

p = CfV^2

Even if the voltage has to remain at a relatively higher VID, the frequency could still be reduced to further increase energy savings.

The FID/VID values should be able to be manipulated independently (to a degree) ie, I would deem it reasonable for Intel to limit the minimum VID for a particular FID, but changing FID alone should not cause any instability.

As it stands, my Xeon 3353 at idle uses about 80W of power, this is probably sitting in C3, at which point it is at the lowest power consumption it can possibly get to.

I'm not pretending to know the reason why the multiplier is low locked at 6, I'm sure there is probably a very good reason - I'd just like to know what it is!

Although my Energy/Frequency graphs would look a lot more interesting if I could see the rest of the curve :-)

levicki · ‎06-24-2008

elesueur,

As you said yourself, voltage is more important part of the power equation. Try plotting a graph to visualize it, and you will see that changing the voltage is more beneficial than changing the frequency.

Frequency changing may be limited for several reasons and I am just speculating here because I do not know for sure.

My first guess would be diminishing returns — i.e. you only lose computing power but the consumption stays more or less the same.

Second guess would be even higher power consumption if the transistors are working outside of the frequency range they were tuned for.

Third guess would be the effects of wire length and layout (parasitic capacitance, etc) which could cause instability if CPU is clocked outside of a certain optimal frequency range.

Finally, lets not forget that CPU caches work at the core speed. Running them at the FSB speed which is what you are suggesting with 1x multiplier would kind of negate any advantage of having a cache in the first place.

None of this what I am saying may be accurate or even true but the point is that modern CPUs were simply not designed to run at low frequencies efficiently.

If you would like to save power (or more likely reduce heat/noise) I would suggest you to try to figure out why your CPU isn't entering lower power state. Kill any unneccessary background tasks which may eat CPU time, check for spyware, rogue drivers, etc. You could also consider an upgrade to more power efficient components such as 45nm CPUs (Core 2 Duo E8200 or Core 2 Quad Q9450 are my favorites), 65nm chipsets, mainboards with powersaving features such as Gigabyte Dynamic Energy Saver, DDR3 memory, etc.

elesueur · ‎06-24-2008

Yea, I agree with you on all accounts...

The Xeon 3353 is a 45nm quad core (well 2 core2 chips stitched together)

It is going into the various sleep states, but what I'm trying to do is generate a model which chooses core frequencies based on performance counter values... this first requires a system characterisation where the benchmarks are run at the various frequencies (pstates) to see the dependence of performance on the core frequency. If a workload is highly memory bound (lots of cache misses forcing accesses from main memory which is still at the same frequency), changing core frequency has little effect on performance, but there is potential to make real energy savings.

I'm not suggesting that running the core frequency at FSB speed would be a good idea, I'm merely asking if there is a technical reason behind the lowest multiplier being 6.

levicki · ‎06-25-2008

I wasn't aware that Xeon 3353 is a 45nm part. It's getting harder to keep track of part numbers by the day.

Anyway, I suggested a few technical reasons why multiplier lower than 6 wouldn't work. I also forgot to mention mainboard and BIOS validation issues.

As for the problem at hand — if I were you I would focus on improving data locality and thus code efficiency instead of trying to reduce power consumption while running inefficient code — I believe that the former is a task for a software engineer, while the later is a task for a hardware engineer working on a CPU design team.

elesueur · ‎06-25-2008

That is not the aim of our research...

The aim is provide a mechanism by which to choose the most energy optimal frequency at which to run a workload based on power and performance predictions using performance counter values at timeslice granularity.

Yes, improving data locality would decrease cache misses and memory references, but that's not what we're trying to do. We're looking at existing workloads, and trying to base optimal frequency predictions around general system characteristics, such as cache misses, which give an indication of how memory bound a certain workload is on a specific platform. We can then choose a frequency which yields the optimal power/performance and run at that frequency for the length of a timeslice. We then look at the performance counter values which were recorded for that timeslice, and pick again based on model predictions.

The hardware engineers provide the mechanism for software engineers to base policy on. The limits of the hardware (i.e. number of power states and hence frequency/voltage range) determine how well the policy can be implemented in software, and the extent to which software can take advantage of the hardware's abilities to save power.

I value your suggestions as to why there are limits imposed on the multiplier, I would like to hear what an Intel engineer has to say though...

levicki · ‎06-26-2008

I hope you won't get offended but to me that research looks kind of pointless.

You have to work with what you have — asking why lowest multiplier limit is 6 won't improve your situation in any way because for current CPU generation that limit won't change.

Next CPU generation (Nehalem) won't have FSB, it will have different cache hierarchy, integrated memory controller, improved power and performance efficiency, and most likely more power states.

In other words, research you do now is short-termed — it will become obsolete in few months while the code improvement leading to greater efficiency would last for generations to come.

Finally, if EIST and C1E are activated, CPU is already changing power states on its own based on workload and it is doing that pretty good. Installing additional software and drivers to control power states and estimate load based on counters which may be in use (by VTune or some other application) just adds to the workload and complexity — it is counter-intuitive to say the least.

I admit that the idea is interesting, but I still believe that your effort should be directed elsewhere. I'll leave you to the Intel engineers now.

Lexi, could you please pass the question about why multipliers are low-locked at 6 to Intel CPU engineering team and post any response you get here?

Intel_Software_Netw1 · ‎06-26-2008

Already done - either our contacts will post here directly or I'll relay responses.
-Lexi

elesueur · ‎06-26-2008

IgorLevicki:
Finally, if EIST and C1E are activated, CPU is already changing power states on its own based on workload and it is doing that pretty good.

The last time I checked, most CPU's did not automatically change their core frequency and voltage based on workload. It is up to the OS to do this, and all we are doing is implementing a model to make it possible to base a frequency choice on something more complex than just 'overall cpu utilisation' like the ondemand governor of cpufreq currently does.

The Freescale IMX31 has hardware controlled DVFS, but not any x86 CPU that I know of...

The work we are doing is more interesting on battery powered devices where race-to-halt may not actually be the most energy optimal policy. In fact some benchmarks we've done on the Xeon have shown that the most energy optimal frequency is not the highest. It all depends on the memory subsystem and the workload characteristics.

IgorLevicki:

I admit that the idea is interesting, but I still believe that your effort should be directed elsewhere. I'll leave you to the Intel engineers now.

I think maybe you're misunderstanding the idea... If you're still interested, I can try to explain more thoroughly what it is we're doing... My only intention in asking the question here is to gain an understanding as to why there were limits imposed on EIST, and possibly some insight into the mechanisms that are activated when I write a value to the IA32_PERF_CTL register.

levicki · ‎06-27-2008

You are right, EIST transitions are not automatic. I came to the wrong conclusion by assuming that since EIST is enabled/disabled in BIOS (by setting proper MSR bit) it must be fully automatic.

I also wrongly assumed based on the information you (have not) provided that your project is for Windows where changing kernel behavior would be next to impossible — you should have been more clear about your project goals. Anyway, now I understand what you want to accomplish.

You will be happy to hear that soon to be released E-0 stepping of E8500, E8400 and Xeon E3110 will support ACNT2 — an improved mechanism for determining processor utilization which is meant to be used for more efficient P-state determination.

Frank_W_Intel · ‎06-27-2008

The data bus is quad-pumped -- that imposes an inherent limit of 4:1 on the machine. If the data were to enter the machine faster than the machine could process it, it would be a highly unbalanced machine.

There arealso some built-in portions of the bus spec that require single cycle turn-around, such as TRDY# to DBSY#. These require certain amounts of internal clocking to support, which essentially sets the minimum ratio at 6:1.

elesueur · ‎06-29-2008

MADfwildgru:
The data bus is quad-pumped -- that imposes an inherent limit of 4:1 on the machine. If the data were to enter the machine faster than the machine could process it, it would be a highly unbalanced machine.

There arealso some built-in portions of the bus spec that require single cycle turn-around, such as TRDY# to DBSY#. These require certain amounts of internal clocking to support, which essentially sets the minimum ratio at 6:1.

Thanks for that - exactly what I wanted.

Igor Levicki:

You will be happy to hear that soon to be released E-0 stepping of E8500, E8400 and Xeon E3110 will support ACNT2 an improved mechanism for determining processor utilization which is meant to be used for more efficient P-state determination.

Yea, that looks very interesting! there isn't much info about it though...