Tools
Explore new features and tools within Intel® products, communities, and platforms
92 Discussions

Enhance The Performance of Intel® Data Direct I/O (DDIO) Workloads Using Intel® VTune™ Profiler

Nikita_Shiledarbaxi
0 0 751

Authors:

Nikita Shiledarbaxi, Software Technical Marketing Engineer, Intel

Rob Mueller-Albrecht, Software Tools Marketing Manager, Intel

 

Profile uncore hardware performance events in Intel® Xeon® processors with oneAPI

 

Intel Data Direct I/O (DDIO) technology is a hardware feature available in Intel® Xeon® processors. It helps achieve I/O performance improvements by making the processor cache the main junction for the I/O data flowing into and out of the Intel® Ethernet controllers and adapters. It is crucial to keep an eye on the uncore events (i.e. events happening outside the CPU core) for monitoring the efficiency of DDIO and the Intel® Virtualization Technology (Intel® VT) for Directed I/O (Intel® VT-d) (that enables independent execution of multiple operating systems and applications).  Intel® VTune™ Profiler, the oneAPI-powered performance analysis and debugging tool, allows you to analyze uncore hardware events and thus enhance the performance of DDIO workloads.

In this blog, we will discuss leveraging VTune Profiler to analyze and optimize directed I/O performance. Before we dive into the profiling technique, let us briefly discuss Intel DDIO technology.

 

Overview of the Intel® Data Direct I/O (DDIO) Technology

 

Intel DDIO is an Intel® Integrated I/O feature first introduced in 2012 as part of the Intel Xeon processor E5 family and Intel Xeon Processor E7 v2 family. It aims at enhancing I/O performance at the system level by following a different flow of I/O data into and out of the processor from that in a classical I/O mode.

Previously, before DDIO technology was available, I/O operations were slow, and processor cache was a very limited resource.  Any incoming and outgoing data from an Ethernet controller or adapter was required to be stored in and retrieved from the host processor’s main memory respectively. The data from main memory used to be brought to the cache first before one could operate on it. This resulted in frequent memory read and write operations. In some of the older architectures, this also triggered some extra, speculative read operations from the I/O hub. Too many memory accesses often result in degradation of the I/O performance and increased system power consumption.

Given that processor cache is no longer a scarce resource, Intel DDIO technology was introduced to restructure the flow of I/O data by making the processor cache (instead of the main memory) the primary source and destination of I/O data. 

Based on the type of workload on the server or at the workstation, the DDIO technique brings advantages such as:

  • Increase in bandwidth,
  • Decrease in latency,
  • Lower power consumption,
  • Higher transaction rates, and more.

The DDIO technology requires no industry enabling. It has no hardware dependencies and requires no changes to your software application, drivers, or operating systems.

→ Detailed information on the Intel DDIO technology 
is available here.
 

Boost DDIO Performance Using Intel® VTune Profiler

 

An uncore event refers to a function executed in the uncore part of a CPU, outside the processor core itself,  that nevertheless impacts overall processor performance. Such events can for example be related to the operation of I/O stacks, the memory controller and the Intel® Ultra Path Interconnect (UPI)[1] block.

A recently published recipe in the VTune Profiler Cookbook describes how the Input and Output analysis feature of the tool can help you count these types of uncore hardware events. The obtained results can aid you in understanding the traffic and behavior of Peripheral Component Interconnect express (PCIe)[2] and hence analyzing DDIO and VT-d efficiency.

The recipe describes the process of running Input and Output Analysis, analyzing the results and grouping the resultant I/O metrics. Essentially, 1st or higher generation of Intel Xeon scalable processor and VTune Profiler v2023.2 or higher are required. The I/O metrics and events discussed in the recipe are based on 3rd Gen Intel Xeon Scalable Processor, but the methodology is also applicable to the latest generation of Intel Xeon Processors.

NOTE: For detailed hardware and software configuration requirements, refer to the ‘Ingredients’ section of the recipe.

 

Perform I/O Analysis with VTune Profiler

First, run the Input and Output analysis of VTune Profiler on your application. The analysis feature allows you to choose from various platform-level metrics for examining the utilization of CPU, utilization buses and I/O subsystems. Enabling the option to analyze PCIe traffic will give you metrics measuring the Intel DDIO utilization efficiency.

 

Analyze the I/O Metrics

The report generated as a result of the Input and Output analysis can be analyzed using VTune Profiler GUI or VTune Profiler Web Server. The recipe demonstrates the analysis of various I/O performance metrics on the VTune Profiler Web Server Interface, such as:

  • Information on CPU execution time
  • Utilization of the physical core, DRAM, PCIe and Intel UPI links through a platform diagram
  • PCIe Traffic Summary, i.e., metrics measuring inbound PCIe traffic (initiated by I/O devices) and outbound PCIe traffic (initiated by CPU). These metrics help calculate PCIe bandwidth and effective utilization, latency for inbound read/write requests, CPU/IO conflicts, and more.
  • Metrics to understand how efficiently the workload uses Intel VT-d technology for re-mapping incoming I/O device memory addresses to different host addresses
  • DRAM and UPI bandwidth utilization

 

Grouping the Analysis Results

The Input and Output analysis gives you in-depth information about I/O metrics resulting from PCIe devices and performance uncore events. You can group these resultant metrics into different kinds of views available on the VTune Profiler GUI or Web Server to derive conclusions such as:

  • I/O metrics (average latencies, CPU/IO conflicts, VT-d metrics, and more) per device
  • Number of uncore events per group of I/O devices
  • Correlation of PCIe traffic with DRAM and UPI bandwidths
→ Refer to the VTune Profiler cookbook's recipe for 
more details on profiling uncore hardware events in
Data Direct I/O applications.
 

What’s Next?

 

Get started with VTune Profiler today – analyze, optimize, and fix hardware and software performance bottlenecks in a variety of applications, including HPC and AI/ML workloads! In addition to I/O analysis, the tool enables several other types of analysis, such as HPC performance characterization analysis, hotspots analysis, memory consumption and allocation analysis, and much more.

Check out other AI, HPC, and rendering tools in our extensive software portfolio powered by oneAPI.

 

Get the Software

 

The Intel VTune Profiler is available as a part of the Intel® oneAPI Base Toolkit. You can also download a standalone version of the tool.  

 

Additional Resources

 

About the Author
Technical Software Product Marketing Engineer, Intel