Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Effect of DVFS on memory load latency information using PEBS on Haswell

sbhal1
New Contributor I
833 Views

Hi,

Does anyone know if the memory load latency information using Precision Event Based Sampling (PEBS) on Haswell is affected when DVFS or any other power control is used? What I want more specifically is to know which frequency PEBS uses to calculate latency? Is it TSC frequency, ratio of APERF and MPERF or anything else?

The TSC frequency I understand is not affected by DVFS or any other power control, while ratio of APERF and MPERF is affected.

Thanks,

Sridutt

 

 

0 Kudos
3 Replies
Travis_D_
New Contributor II
833 Views

Which specific PMC events are you referring to when you say "memory load latency information"?

About APERF and MPERF, at least MPERF should not be affected. It ticks at TSC frequency for most chips, with the exceptions being Skylake and beyond (where it ticks at "nominal CPU frequency", which is subtly different than TSC frequency on those chips), and on on of the Phi-type chips where it counts only every 1,000 clocks or so.

APERF, of course, counts at the actual CPU frequency.

0 Kudos
sbhal1
New Contributor I
833 Views

Travis D. wrote:

Which specific PMC events are you referring to when you say "memory load latency information"?

About APERF and MPERF, at least MPERF should not be affected. It ticks at TSC frequency for most chips, with the exceptions being Skylake and beyond (where it ticks at "nominal CPU frequency", which is subtly different than TSC frequency on those chips), and on on of the Phi-type chips where it counts only every 1,000 clocks or so.

APERF, of course, counts at the actual CPU frequency.

When I meant ratio of APERF by MPERF, I was suggesting that APERF changes making the ratio to change.

By "Memory Load Latency Information", I refer to 18.8.1.2 Load Latency Performance Monitoring Facility in Intel Software Developer Manual Vol 3B Page 18-40 (September 2016). It characterizes the average load latency to different levels of cache/memory hierarchy.

I wanted to know if the number of accesses that took say 'x' cycles to complete increase or decrease when the frequency is changed using DVFS or Dynamic Duty Cycle Modulation (T-states), is it because of the change in frequency (as the cycle length increases) or due improvement/worsening of cache/memory access behavior (hit-rate, latency etc.).

0 Kudos
McCalpinJohn
Honored Contributor III
833 Views

Section 18.8.1.2 says that the Load Latency Performance Monitoring Facility counts in core cycles.  To convert to seconds, you will need to know what the frequency was at the time that the load occurred.  This may be inconvenient.

On the plus side, it is extremely unlikely that the processor could allow the frequency to change *during* the execution of a load.  The first paragraph of Section 6.6 of Volume 3 of the Intel Architecture SW Developer's Manual notes:

"All interrupts are guaranteed to be taken on an instruction boundary."

If the core stall required for a frequency change is implemented by the same mechanism that is used for interrupts, then this is enough to ensure that the frequency cannot change while the load is executing -- the interrupt associated with the frequency change must occur either before or after the load.

Some caveats, of course....

  • I don't know enough about Intel's frequency-change implementation to know if it is guaranteed to behave like an interrupt.
  • It does not look like the PEBS record records the current processor p-state as part of the atomic information gathered. 
    • This information can easily be read in the PEBS interrupt handler (from the IA32_PERF_STATUS MSR), but one can imagine horrible cases in which the frequency-change "interrupt" occurs in the middle of the PEBS interrupt handler, causing it to read the new performance ratio instead of the ratio that was in effect when the load actually executed.

Some more notes:

  • Intel processors typically run the Core, L1 Cache, and L2 Cache at the same frequency.  Loads that hit in the L1 or L2 take the same number of cycles independent of the core frequency.
  • There is usually a frequency transition between the L2 and the L3.  If the L3 is running at a fixed frequency, then one would expect the reported cycles of latency for an L3 hit to *decrease* as the core frequency is *decreased*.  This is because part of the processing of the cache hit takes place at the L3 frequency, which is *increasing* relative to the core frequency in this scenario.
  • There are additional frequency transitions for QPI, DRAM, other chip's "uncore", and maybe more.   The number of cycles reported for the Load Latency will depend on those frequencies, even if the local core frequency remains constant.
    • For example, on a 2-socket system using the Xeon E5 "Sandy Bridge" generation of processors, the latency for *local* memory reads appears to include an additive component consisting of 12 cycles of the "uncore" frequency of the *remote* socket.
0 Kudos
Reply