Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Power Readings on Intel Sapphire Rapid

techsun
Beginner
596 Views

I am working on an Intel(R) Xeon(R) Gold 5418Y (SapphireRapids), which is a two-socket machine, and the TDP value of each socket is 185 watts. 

I am using MSR_PKG_POWER_LIMIT (0x610) MSR [63:0] to set the powercap (140watt = 75% of TDP) at the socket level. 

Following are the values set to specific bits by me:

Pkg Power Limit #1 ---> bit (0-14) = 140watt

Enable limit #1 ---> bit(15) = 1

Pkg clamping limit #1 ---> bit(16) = 1

Time window Power Limit #1 ---> bit(17-23) = 0 (976 micro sec)

Reserved bit --> (24-31) = No change

Pkg Power Limit #2 ---> bit (32-46) = 140watt

Enable limit #2 ---> bit(47) = 1

Pkg clamping limit #2 ---> bit(48) = 1

Time window Power Limit #2 ---> bit(49-55) = 0 (976 micro sec)

Reserved bit --> (56-62) = No change

Lock bit ---> bit(63) = 0

I run an application to a socket that uses all cores on the socket.  Simultaneously, I am also using MSR_PKG_ENERGY_STATUS (0x611) to read the energy consumption in every 100ms, calculate power by dividing the energy reading with 100ms.  Sometimes, I observed power readings beyond the power limit set by me. Ideally, all power readings should be below 140W. I have never experienced this kind of issue in previous architecture. Please suggest the correct way to read power if I am doing anything wrong.

 

0 Kudos
7 Replies
techsun
Beginner
517 Views

Hi @McCalpinJohn,

It would be a great help if you could share any solution on this. 

 

Thanks

0 Kudos
McCalpinJohn
Honored Contributor III
368 Views

I have not attempted to use those power limit controls on Sapphire Rapids.  Note that you have set both power limit windows to ~1 second, but are reading every 0.1 seconds.   The windowed limit might be working fine, with excursions that get averaged out over the 1 second window.  You might try looking at the 1 second averages?

0 Kudos
techsun
Beginner
223 Views

Sorry John for late reply.

I haven't set the time window to ~1 sec. It is 976 microseconds. 

I have used to calculate the time window limit using the formula mentioned in the intel doc.

'''''''''''''

Time Window for Power Limit #1 (bits 23:17):   Indicates the time window for power limit #1

Time limit = 2^Y * (1.0 + Z/4.0) * Time_Unit

Here, “Y” is represented by the unsigned integer value. by bits 21:17, “Z” is an unsigned integer represented by bits 23:22.

Here, the Time Unit is 976 microseconds. 

'''''''''''''''

So, I have set all bits 23:17 to 0x0. Hence, the Time window is equal to the Time unit (976 microseconds). 

 

0 Kudos
McCalpinJohn
Honored Contributor III
180 Views

Sorry -- I confused 976 microseconds with 976 milliseconds.   

My understanding of the RAPL power-limiting mechanism is that it *reacts* to history rather than *predicts* the future.  If your processor (like most Intel processors) only updates the RAPL energy counters at intervals of 1 millisecond, I would expect any single sample to be able to exceed the limit, after which the frequency would be lowered.   This does not explain why 100-millisecond intervals would show power exceeding the specified limit, so that confuses me too....

Remember that RAPL is "Running Average Power Limit" is a "running average".  If you set the averaging interval to equal a single sample, you will not actually be averaging anything.  Setting both limit #1 and limit #2 to use the same single-sample "averaging interval" might provide weaker control on the power than you would expect?   Since you are measuring power at 100-millisecond intervals, you might try programming limit #1 to 140 Watts with a longer window (10ms to 100ms) and see if this changes the observed results.

Caveats -- I have only done a small number of studies with this feature on SKX and CLX processors, but I did find that the requested limit was obeyed accurately over long time scales (seconds or longer).  If I recall correctly I never tested with limit #1 and limit #2 set to the same values, and I never studied the effect of modifying the "clamping limit" bits.

0 Kudos
techsun
Beginner
139 Views

Thanks, John.

I tried your solution and adjusted the time window to different values (ranging from 1.2ms to 5ms). The results changed, and all power readings were below the power limit. However, the values are significantly lower than the power limit (by 4% - 17%), which is unexpected. I'm not sure why this is happening.

 

0 Kudos
connorimes
Novice
343 Views

I would measure the actual time elapsed between reads and use that as your divisor when computing power, rather than assuming your polling interval.  Since you said your application is using all cores on the socket, there may be competition for a core and so your read intervals might be non-trivially longer than the sleep time in your power polling thread/application.  This would result in measuring energy consumed over a longer time interval than you expected (and thus a higher power estimate than in reality, if you assume the time interval rather than measuring it).

0 Kudos
techsun
Beginner
223 Views

Hi Imes, 

Thanks for the reply.

Actually, I am using TSC_STAMP_COUNTER MSR to avoid this situation while measuring elapsed time. So, I don't think it can be an issue.  

0 Kudos
Reply