Analyze and Optimize Performance on High Bandwidth Memory (HBM) CPUs using Intel® VTune™ Profiler

Nikita_Shiledarbaxi · ‎01-31-2024

High Bandwidth Memory (HBM) is a memory interface that provides high bandwidth yet consumes less power by leveraging 3D stacking technology. The Intel® Xeon® CPU Max Series is the only x86-based processor with HBM. Its HBM feature helps boost the performance of the 4th Gen Intel® Xeon® Scalable Processors (code-named Sapphire Rapids) in real-world applications, including AI/ML, HPC, data analytics, and modeling. The HBM CPUs achieve up to 5x better performance for memory bandwidth[1][2] and up to 20x speed-up for Numenta AI technology used for Natural Language Processing (NLP)[3]. HBM usage optimization is crucial for memory-bound or data-intensive applications. This blog will highlight our recent webinar about how the Intel® VTune™ Profiler tool helps in workload performance optimization on HBM CPUs.

The webinar covered the following topics:

4th Gen Intel® Xeon® Scalable Processors in brief
HBM is available on the Intel® Xeon® CPU Max Series
Overview of Intel VTune Profiler
Practical demonstration of profiling a system with HBM using Intel VTune Profiler on the Intel® Developer Cloud platform

Check out the complete webinar recording.

4th Gen Intel® Xeon® Scalable Processor: An Overview

The 4th Gen Intel® Xeon® Scalable Processors stand out with the highest number of (i.e., 14) built-in accelerators among CPUs in the market. Developers can harness these accelerators through Intel® oneAPI tools, leading to improved performance. Recent Intel® accelerator engines and software optimizations result in an average 2.9x increase in performance per watt[4] across various workloads. Intel® toolkits offer advanced compilers, libraries, analysis/debug tools, and optimized frameworks for the simplified development of accelerated solutions. These toolkits efficiently utilize Intel® hardware accelerator instruction sets, including Intel® Advanced Matrix Extensions (Intel® AMX), Intel® QuickAssist Technology (Intel® QAT), Intel® Data Streaming Accelerator, and Intel® In-Memory Analytics Accelerator (Intel® IAA).

About the HBM on Intel® Xeon® CPU Max Series

The Intel® Xeon® CPU Max Series comes with:

64GB of ultra-high-bandwidth in-package memory
4 stacks of HBM2e
>1GB of HBM capacity per core

Depending on the memory bandwidth required by your workload, you can HBM in one of the three modes:

HBM Only: The system utilizes only HBM; there is no Dual In-line Memory Module (DIMM) attached to the server.
HBM Flat Mode: Both the Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM) and the HBM are available to be used by the system.
HBM Caching Mode: It uses DDR SDRAM as a cache.

Intel® VTune™ Profiler: A Brief Introduction

Intel VTune Profiler is a versatile application analysis tool that offers broad functionality to identify software bottlenecks in various hardware components like CPU, GPU, FPGA, memory, and cache storage. Through Hotspot Analysis, it lets you analyze the execution flow of an application for algorithm analysis. You can also identify and examine the top hotspots (sections of the program that consume the maximum execution time) and CPU utilization.

The tool supports multiple languages such as C, C++, DPC++, Fortran, Python*, Go, and Java*. It operates on diverse operating systems, including Windows* and Linux*. It is compatible with containers and VMs, though modifications may be needed for permissions and configurations. It helps you in several ways:

Enables fast application analysis without requiring recompilation or additional flags.
Features a powerful GUI for visualizing system components and application execution timelines, efficiently filtering out unnecessary data, and running queries for organized insights.
With both graphical and command-line interfaces, the tool provides flexibility and ease of use, allowing profiling data collection from local and remote servers.

HBM Utilization Analysis using Intel® VTune™ Profiler

The Intel VTune Profiler GUI allows you to analyze HBM utilization through its Microarchitecture Exploration analysis and HPC Performance Characterization analysis (both with the ‘Analyze memory bandwidth’ option enabled) and Memory Access analysis.

The Bandwidth Utilization section provides a histogram of bandwidth utilization (as shown in Fig. 1 below), allowing users to analyze the Intel® Ultra Path Interconnect (Intel® UPI), HBM, and DDR performance for their application. This feature enables a detailed examination of how these components are running overall and how the data bandwidth utilization can be optimized at any given time during the execution flow.

Fig.1: Bandwidth Utilization Histogram

The Timeline Pane provides a deeper view to analyze package or socket traffic, revealing insights into how the individual CPUs or sockets communicate with memory and each other, especially regarding cross-socket traffic.

Additionally, the memory access analysis platform diagram shows traffic from the socket to HBM and DRAM, contingent on installation and configuration. The configuration depiction includes insights into memory modules and cross-socket traffic, helping identify Non-Uniform Memory Access (NUMA) issues.

For a detailed understanding of HBM utilization analysis using Intel VTune Profiler, check out the documentation: Microarchitecture Analysis Group

Watch the webinar from [00:14:15] for a practical demonstration on how to profile a system with HBM using Intel VTune Profiler on Intel Developer Cloud.

What’s Next?

Check out the webinar and get started with Intel VTune Profiler to optimize workload performance and take the best possible advantage of Intel® Xeon® CPU Max Series, the only x86-based processor with HBM. We encourage you to check out other AI, HPC, and Rendering tools in Intel’s oneAPI-powered software portfolio.

Additional Resources

Reference

ChatGPT 3.5 and Shiledarbaxi, N. (2024, January 21). Summarize the transcript [AI-generated text].
OpenAI. https://chat.openai.com

[1] Visit intel.com/performanceindex (Events: Supercomputing 22) for workloads and configurations. Results may vary.

[2] 2S Intel Xeon Max CPU vs. 2S AMD EPYC 7773X and 2S 3rd Gen Intel® Xeon® 8380

[3] Numenta BERT-Large AMD Milan: Tested by Numenta as of 11/28/2022. 1-node, 2x AMD EPYC 7R13 on AWS m6a.48xlarge, 768 GB DDR4-3200, Ubuntu 20.04 Kernel 5.15, OpenVINO 2022.3, BERT-Large, Sequence Length 512, Batch Size 1. Intel® Xeon® 8480+: Tested by Numenta as of 11/28/2022. 1-node, 2x Intel® Xeon® 8480+, 512 GB DDR5-4800, Ubuntu 22.04 Kernel 5.17, OpenVINO 2022.3, Numenta-Optimized BERT-Large, Sequence Length 512, Batch Size 1. Intel® Xeon® Max 9468: Tested by Numenta as of 11/30/2022. 1-node, 2x Intel® Xeon® Max 9468, 128 GB HBM2e 3200 MT/s, Ubuntu 22.04 Kernel 5.15, OpenVINO 2022.3, Numenta-Optimized BERT-Large, Sequence Length 512, Batch Size 1.

[4] Geomean of following workloads: RocksDB (IAA vs ZTD), ClickHouse (IAA vs ZTD), SPDK large media and database request proxies (DSA vs out of the box), Image Classification ResNet-50 (AMX vs VNNI), Object Detection SSD-ResNet-34 (AMX vs VNNI), QATzip (QAT vs zlib)