Cloud
Examine critical components of Cloud computing with Intel® software experts
142 Discussions

Get The Most Out of Your Intel Cloud Infrastructure with Virtual Performance Monitoring Units

PhilipArellano
Employee
0 0 634

Summary

When systems underperform, it’s all hands on deck to identify and resolve the source of a problem, but there are numerous tools and methods to analyze performance data to try to identify a bottleneck. Cloud platforms present an additional issue in that many analysis tools behave differently or are not available in virtual environments. Performance monitoring (perfmon) provides a powerful analysis option to debug performance issues by observing hardware events directly but wasn't widely available on cloud virtual machines (VMs) until recent generations. Virtual Performance Monitoring Units (vPMUs) enable perfmon analysis in the cloud and unlock hardware events and metrics that can't be observed otherwise.

Using vPMUs for performance optimization can deliver dramatic performance improvements. Netflix achieved a 3x performance increase by using vPMU analysis to identify and resolve cache-line sharing bottlenecks that traditional profiling tools missed (discussed later in this paper). This represents significant cost savings potential: instead of scaling horizontally with more instances, organizations can optimize existing workloads to extract maximum value from their cloud infrastructure investments.

Cloud service providers such as Amazon Web Services (AWS) and Google Cloud Platform (GCP) expose Virtual Performance Monitoring Units (vPMUs) on select Intel® Xeon® Processor instances that allow you to measure critical performance parameters like instruction cycles, cache misses, and branch mispredictions. While Intel publishes the complete list of supported performance monitoring (perfmon) events and metrics, virtualized cloud instances typically support only a subset of these capabilities. Additionally, since perfmon counters provide low-level hardware telemetry, it isn't always clear when and how they'll be most effective for your specific use case.

In this post, we'll show you how to leverage vPMUs to profile your workload performance, walk through a real-world Netflix optimization case study, and provide comprehensive lists of perfmon events and metrics supported across specific AWS and GCP instance types.

vPMU Description

Performance Monitoring Units (PMUs) provide dedicated hardware support for measuring performance parameters. Intel exposes a comprehensive set of perfmon events for each processor that count specific hardware actions, such as CPU cycles, instructions retired, and L2 cache misses. You can combine these raw events using formulas to calculate higher-level performance metrics like instructions per cycle (IPC) or L2 cache misses per instruction (L2 MPI).

Since we're focusing specifically on virtualized cloud environments, we'll use the term vPMUs throughout this post. While vPMUs are accessed similarly to their bare-metal counterparts, there are some limitations that lead to significant reductions in the available perfmon events that are supported in cloud instances. In particular, the nature of cloud instances requires logical isolation between VMs sharing a hardware platform. As such, events that provide information about resources shared between cores could allow VMs to access information about their neighbors, which breaks the isolation between VMs unless handled in a very precise way. As such either CSPs choose not to expose these uncore or offcore events, or they only make a few carefully selected events available that have gone through proper design and validation to ensure isolation.

You have the option to manually program PMU events directly (detailed in Intel® Software Developer's Manual Volume 3B, chapter 20.) But most performance engineers prefer established tools like Linux perf or Intel's emon (event monitor) that abstract away the low-level complexity. Since manual programming is typically reserved for specialized scenarios requiring minimal overhead, we'll focus on these more accessible general-purpose tools.

PMU events fall into two categories: architectural events that remain consistent across processor generations, and non-architectural events that vary between microarchitectures.

Putting vPMUs to Work

Performance engineers rely on vPMUs to optimize both systems and compilers for real-world application performance gains. The typical workflow involves recording event counts during normal workload operation, then post-processing this data into higher-level metrics. Microarchitectural analysis of these metrics reveals performance hotspots and bottlenecks that become prime targets for optimization efforts.

Top-Down Microarchitecture Analysis

One method we recommend for understanding perfmon metrics is top-down microarchitectural analysis. This systematic approach organizes dozens of individual perfmon metrics into a clear hierarchy starting with four high-level categories (referred to as “level 1” metrics): front end, back end, bad speculation, and retiring. These first three metrics account for bottlenecks, resulting in CPU pipeline stalls that prevent instructions from retiring. Each level 1 metric then breaks down into progressively more detailed sub-metrics (level 2, level 3, and so on), creating a structured path from broad performance characterization to specific optimization opportunities.

This method is described in greater detail in the Intel® VTune™ Profiler Performance Analysis Cookbook.

Recommended Tools

Rather than manually programming perfmon model-specific registers (MSRs), we strongly recommend using established tools that abstract away the low-level complexity. Here are our top recommendations.

Perf

Linux perf serves as the industry-standard performance analysis tool, capable of measuring perfmon events and calculating higher-level metrics. Find comprehensive usage information on the Linux perf man page and Brendan Gregg's excellent perf tutorial site.

PerfSpect

Intel maintains PerfSpect, a specialized tool that leverages perf to capture key perfmon events and presents the results in a graphical view of high-level metrics along with a curated list of metrics, including hierarchical TMA metrics, without requiring a driver. This tool is best for initial investigations where a kernel driver cannot be used. Refer to the PerfSpect GitHub repository for more information.

VTune

Intel® VTune™ Profiler is our comprehensive performance analysis solution designed for algorithm hotspot analysis and microarchitectural bottleneck identification. It excels at revealing complex interactions between workloads and processor hardware, though it can be somewhat heavy to install and use. The suite includes emon, which we recommend for lightweight profiling scenarios. 

Emon

Bundled with Intel® VTune™ Profiler, Emon (Event Monitor) provides command-line performance profiling through direct perfmon event measurement. While offering similar capabilities to perf, Emon uses a dedicated driver for event collection and includes built-in metrics formulation scripts. While Emon’s perfmon functionality overlaps significantly with perf, its spreadsheet output format is designed to be used for performance analysis use cases, and it regularly maintained and updated with Intel’s recommended set of useful perfmon events and metrics. Consult our Emon user guide for detailed implementation guidance.

Real-world Success Story: Netflix Performance Optimization

Our collaborative blog post with Netflix, "Seeing through hardware counters: a journey to threefold performance increase" (November 2022), demonstrates the real-world impact of perfmon-driven optimization. Netflix migrated a Java microservice from m5.4xl instances (16 vCPUs) to m5.12xl instances (48 vCPUs), expecting nearly triple throughput improvement from increasing the vCPU count per instance. Instead, they initially observed only a 25% increase in Requests per Second (RPS), a clear indication that something was limiting their scaling efficiency.

Investigation Methodology

Netflix began their investigation using high-level analysis tools including perf and JVM-specific profilers (hotspot statistics and Java Flight Recorder). While these higher-level approaches were able to identify a problem: a bimodal distribution pattern with lower and upper band exhibiting lower and higher CPU utilization and latency respectively, they didn't supply the sufficient context to understand the root of the problem and how to solve it. Suspecting the answer may be hidden deeper in the architectural, they turned next to low-level perfmon analysis. This high-to-low level strategy reflects our recommended approach: start with high-level telemetry since low-level perfmon metrics can be difficult to interpret without broader performance context.

Diagnosing Cache-Line Sharing with vPMUs

Using PerfSpect, Netflix captured detailed perfmon metrics and compared performance characteristics between underperforming and optimal modes. The analysis revealed significantly elevated counts for metrics associated with "false sharing" — including L1 and L3 cache activity spikes and increased MACHINE_CLEARS events. False sharing occurs when multiple threads repeatedly access unrelated data that happens to reside on the same cache line, causing unnecessary cache coherency traffic.

Netflix resolved the initial issue by padding the problematic variables to use full cache lines, preventing different threads from allocating into the same cache line. However, subsequent VTune analysis revealed a second sharing problem. This time, however, it was "true sharing" where multiple threads legitimately competed for dependent data. They solved this by restructuring data access patterns to minimize contention.

These changes resulted in a total 3.5x improvement over the initial throughput they observed on the m5.12xl instance type.

Key Takeaways

Netflix's success with debugging this complex performance issue demonstrates both the power of Top-down Microarchitectural Analysis (TMA) and the importance of understanding event correlation patterns that indicate specific performance issues. This case study also highlights a key challenge in vPMU work: success requires substantial knowledge and experience to interpret trends in the data and to translate that into feasible optimizations.

vPMU Availability Across Cloud Platforms

As mentioned earlier, vPMUs come in two varieties: architectural events (consistent across processor generations) and non-architectural events (microarchitecture-specific). This means perfmon capabilities can vary significantly between different processor models.

Virtualized environments add another complexity layer, as cloud service providers may selectively enable or disable specific events or counting mechanisms. This variability makes vPMU data collection and analysis more challenging on cloud instances compared to bare-metal systems.

To address these challenges, we've compiled comprehensive lists of available perfmon events and metrics across various cloud instance types, providing clear visibility into what's supported where.

Bare-Metal

Bare-metal systems (non-virtualized environments) expose all PMUs without virtualization-imposed restrictions. We use AWS and GCP metal instances as our control baseline, providing a comparison reference for VM instances. These event sets align with the perfmon specifications detailed in our GitHub repository.

Cloud Service Provider Implementations

When CSPs enable vPMUs on VM instances, they typically provide a subset of bare-metal capabilities. The exact available events and metrics depend on instance type, underlying processor generation, and CSP-specific configuration parameters. Some platforms require explicit vPMU enablement and visibility level configuration.

AWS

AWS offers vPMU support across Intel-based 5th generation (m5, c5, etc.) and newer EC2 instance families, including both metal and virtualized variants; however, on 5th and 6th generations, vPMU support is only offered on full socket or metal based instance-types. On 7th generation AWS EC2 instance families, vPMUs are supported on all instance-types.

  • Metal Instances
    AWS metal instances provide equivalent PMU support to bare-metal systems.
  • Virtualized Instances
    AWS virtualized instances support a subset of perfmon events and metrics. Notably, many uncore or offcore events are not supported to ensure isolation between VMs. Additionally, the fixed counter for CPU cycles is unavailable, though CPU cycles can still be measured using programmable counters. This means perf, PerfSpect, and VTune maintain full CPU cycle counting capability, while emon may have limitations on these instances. Some additional individual counters may be missing—consult our detailed instance-specific documentation for complete compatibility information.

GCP

GCP has both metal instances and virtualized instances with support for PMUs. Supported metal Instances include c4.metal, c3.metal and z3.metal, which support all PMU counters. In virtualized instances, GCP provides tiered vPMU support in c4 instance families with three distinct capability levels: architectural, standard, and enhanced. Some individual counters may be unavailable depending on the tier—reference our detailed compatibility documentation for instance-specific details.

Unlike AWS, GCP requires vPMUs to be explicitly enabled, and configured to one of three PMU types: Architectural, Standard, and Enhanced. Details on how to enable vPMUs in GCP can be found in Google’s GCP documentation.

  • Architectural PMU Type
    Includes only the architectural events specified in Intel's Software Developer Manual. This is a focused set of fundamental performance events that remain consistent across processor microarchitectures.
  • Standard PMU Type
    Expands beyond architectural events to include hardware events within the core complex such as core, L1 cache, and L2 cache monitoring. This tier includes all architectural events plus core-level performance counters.
  • Enhanced PMU Type
    Provides the most comprehensive coverage, including off-core hardware events such as L3 cache and memory subsystem monitoring. Enhanced tier includes all architectural and standard events plus system-level performance visibility. Currently, c4 instance-types on Intel® Xeon® 6 processors (formerly Granite Rapids) support Enhanced PMU capabilities, including, but not limited to the following shapes: c4-standard-144, c4-standard-288, or c4-highmem-144 support Enhanced PMU capabilities: full or dual socket instance-types.

vPMU Support Reference Materials

We've developed comprehensive GitHub repositories containing detailed availability matrices for each cloud platform and instance type. These resources provide specific event and metric availability for various instance types.

Methodology

We used perf stat to collect vPMU data for various instance types on AWS and GCP and separated each event and metric into succeeded and failed per instance type. These two categories are further split into additional categories based on the value perf stat returned when collecting each metric.

The command we used is shown below:

perf stat --timeout 100 -e EVENT stress-ng -m num_cores

We ran each event for 100 milliseconds, targeting stress-ng matrix multiplication as a load generator to stimulate the system into return more interesting values from the vPMUs.

As an alternate method, Edwin Chiu details his usage of PMUs in Google Kubernete's Engine (GKE).

Metrics versus Events

Since metrics are there result of formula using events as inputs, the success or failure of a metric is based on the events it uses. Refer to the metric formulas, and if all the events are considered supported, the metric is too. If one or more event in the formula is not supported, then the metric is not either.

Document Layout

  • Succeeded Events and Metrics
    Based on the value returned by perf stat, we consider these as supported by that instance-type.
    • Non-Zero
      The event or metric returned a non-zero value, so there is high confidence that these are supported.
    • Zero
      The event or metric returned a value of zero, so while we did not count any instances of the metrics trigger, it’s likely that this event just didn’t occur within the 100 millisecond sampling window. In most cases, events or metrics reporting zero values indicate that the event or metric is supported.
      Events or metrics reporting zero values are more ambiguous with GCP vPMUs. This is because the mechanism GCP uses to disable events for Standard and Architectural PMU types seems to cause the disabled events—and thus the metrics that use them—to report zero. As such, we suggest that any GCP metric that reports Zero in Standard or Architectural PMU types but returns a non-zero value in Enhanced, is not supported. There may be some cases where a metric reports zero in Enhanced as well as the others where it was disabled, but these cases have not been exhaustively tested in our data.
    • Not a Number (NaN)
      A small subset of metrics returned a value of “NaN” indicating that none of the underlying events failed, but after calculating the metric formula, the result was not a number. In the cases that we inspected, these were the result of a divide-by-zero, where the event used in the formula’s divisor returned zero.

      Refer to the metric formulas, and if all the events are considered supported, the metric is too. If one or more event in the formula is not supported, then the metric is not either.  

  • Failed Events and Metrics
    Based on the value returned by perf stat, we consider these as not supported by that instance-type.
    • Not Supported
      On some occasions, after targeting an event or metric with “perf stat”, the process returned but reported that some events were “not supported”. We consider these events and the metrics that use them as not supported by that instance-type.
    • Error
      Most failed events and metrics were identified when “perf stat” errored out while reading the events. These events are not supported by that instance-type.

 

Notices and Disclaimers

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available ​updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software, or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.