Memory bandwidth has been a bottleneck of increasingly memory bound workloads especially high performance computing, artificial intelligence, machine learning, and data analytics. Next Generation Intel® Xeon® Scalable Processor code named Sapphire Rapids plus HBM is specifically targeted to cater to these workloads, traditionally served using overprovisioning of memory devices. As shown in Figure 1, this processor consists of 64 GB of high-bandwidth memory (HBM) per socket organized as four HBM2e stacks, in addition to 8 channels of DDR5. In this document, we describe how software optimization can achieve the best performance using HBM.
Figure 1: HBM organized as four HBM2e stacks
HBM can be exposed to software using three different memory modes. These memory modes are selected through the BIOS menu when the system boots up. These modes are:
- HBM-only mode
- Flat mode
- Cache mode
The following subsections describe each mode in detail.
If the system does not have any DDR modules installed, it boots up in the HBM-only mode. In this mode, HBM appears to software as a single flat-memory address-space, no differently than how DDR is used on DDR-only-based systems. Existing software does not have to take any additional steps to use HBM memory in this mode if the total memory consumption (that of the application, the operating system(OS), and other memory-resident services) does not exceed 64GB maximum HBM capacity per socket.
If the total memory consumption is more than 64GB per socket, the user may decompose the application across multiple sockets or nodes as is often done for HPC applications using Message Passing Interface (MPI). It may also be possible to lower the memory consumption of libraries (for example, MPI buffers) and the operating system (OS) services (for example, file caches) to allow the total memory consumption to remain below 64GB. To identify the memory consumption of the system, the user may use the numastat Linux utility or the Intel VTune(TM) Profiler.
When DDR modules are installed on a system, the user can select flat mode. In this mode, both DDR and HBM address spaces are visible to software. On a socket, each address space is exposed as a separate Non-uniform Memory Access (NUMA) node as shown in Figure 2.
Figure 2: HBM and DDR exposed as two separate NUMA nodes on the first socket.
By default, all allocations go to NUMA node 0, which is DDR. As a result, OS, and other services allocate memory from DDR, and therefore memory on HBM node is completely available for applications. This behavior is the primary advantage of flat mode as compared to the HBM-only mode, which uses HBM memory for non-application tasks.
Exposing DDR and HBM as two separate NUMA nodes allows the use of standard Linux NUMA utilities and interfaces (for example, numactl and libnuma) to place an application in the desired address space. For instance, if all the memory for an application should be allocated from HBM (NUMA node 1), we can use the standard numactl utility. For example,
numactl -m 1 ./a.out
The -m (or --membind) option tells numactl to bind all memory allocations to NUMA node 1 (HBM). In this case, if an application’s allocation exceeds the capacity of NUMA node 1, the allocation will fail. Instead, we can use -p (or --preferred) option to prefer NUMA node 1 to allocate memory. For example,
numactl -p 1 ./a.out
In this case, all allocations are satisfied from NUMA node 1 until it fills up. If the preferred node is full, the allocations fall back to next closest node, which is NUMA node 0 (DDR). This policy accommodates applications with footprints larger than 64GB, although the allocations on DDR are served at DDR bandwidth, which is lower than HBM bandwidth.
Instead of using numactl, it is also possible to use the Intel MPI environment variable I_MPI_HBW_POLICY for programs to allocate memory from HBM. For example,
mpirun -genv I_MPI_HBW_POLICY hbw_bind -n 2 ./a.out
mpirun -genv I_MPI_HBW_POLICY hbw_preferred -n 2 ./a.out
Since the user does not have to specify the HBM node number explicitly, the flat mode usage is more convenient for MPI applications, especially when there is more than one socket.
On systems with both HBM and DDR, cache mode allows HBM to function as a memory-side cache, which caches the contents of DDR. In this mode, HBM is transparent to all software because the HBM cache is managed by hardware memory controllers. The entire DDR space is visible to software, and hence this mode can be used to run applications with footprints that exceed 64GB per socket.
Since the HBM cache is transparent to software, cache mode does not require any software modifications to take advantage of HBM. A symmetric population of DIMMs among the four memory controllers is required for cache mode. For best performance, all eight DDR channels should be populated.
In cache mode, HBM is organized as a direct-mapped cache. Direct mapped caches can have high miss rates due to conflict misses in certain situations. These conflict misses occur when two or more different addresses in DDR map to the same location in the HBM cache, forcing the direct-mapped cache to hold only one of them. These conflicts lead to lower performance and higher performance variability.
Linux kernels v5.4 and later support two features that can mitigate the effects of direct mapped HBM cache in some situations.
Linux kernel mitigations for HBM cache-mode
Fake NUMA: This feature is enabled using a kernel boot option (numa=fake). It allows physical memory of a system to be divided into “fake” NUMA nodes. For instance, 256GB of physical memory on a socket can be uniformly divided into four fake NUMA nodes, each with 64GB, using kernel boot option numa=fake=4U. Allocations within such a 64GB fake NUMA node are guaranteed to be conflict-free. The main advantage of fake NUMA is for running applications that fit within 64GB of memory footprint in cache mode.
Without fake NUMA, even apps that have footprints smaller than 64GB could encounter conflict misses in HBM cache due to physical memory fragmentation. Fake NUMA allows the user to place an application entirely in one of the fake-NUMA nodes (for example, by using numactl), thereby completely avoiding any conflict misses in the HBM cache. This allocation leads to higher overall performance and lower performance variability. When cache mode is selected to support a variety of applications with small and large memory footprints, fake NUMA is strongly recommended to provide the best performance for those applications with footprints below 64GB.
Page Shuffling (Randomization): This feature causes memory pages to be allocated using a random placement policy leading to a more uniform distribution of pages across all physical memory. This leads to higher performance predictability, especially useful when the application memory footprint exceeds 64GB. Note that users must still use fake NUMA for best performance when the memory footprint is less than 64GB. This feature is enabled in kernel v5.4 or later with boot option page_alloc.shuffle=y. Its presence can be checked with file: /sys/module/page_alloc/parameters/shuffle.
BIOS allows a socket to be partitioned into four different NUMA nodes as shown below. This feature is called SNC4 (Sub NUMA Clustering) and provides potential for higher bandwidth and lower latency. In this mode, each node has two DDR channels and one HBM2e stack with 16GB of capacity. SNC4 is available with all memory modes. From a software point of view, a single socket with SNC4 is analogous to a 4-socket Intel Xeon processor-based system, which contains four NUMA domains. Software must be NUMA-optimized to best use SNC4. Recommended usage is at least one MPI rank for each NUMA node when using SNC4 (for example, 4, 8, or 12 MPI ranks per socket).
Figure 3: SNC4 partitions CPU into 4 NUMA nodes.
All standard Linux NUMA utilities and libraries work with SNC4 as they do on any other NUMA system. In flat mode, using Intel MPI environment variable I_MPI_HBW_POLICY is more convenient with SNC4 because the user does not have to identify HBM node numbers explicitly.
Best Practices for Enabling HBM
In each memory mode, the best performance using HBM is obtained when the memory footprint of an application completely fits within 64GB per socket HBM capacity (or 128GB capacity per two-socket node). Consequently, getting an application to fit within 64GB of HBM per socket is the most important step in obtaining the best performance with HBM.
Techniques such as spreading out work to multiple nodes/sockets using MPI and using shared-memory techniques like OpenMP within the same socket can be useful in reducing memory consumption on a socket. Further, it may be possible to further optimize the impact of OS file caches, communication buffers, etc. by reducing their sizes, especially in the HBM-only mode. These optimizations result in reduced pressure on HBM.
When the application footprint is much larger than 64GB per socket, cache mode can be used with page shuffling enabled to reduce variability from run to run, or to reduce variability across nodes on a multi-node cluster. Fake NUMA is highly recommended in the cache mode when the application can fit within 64GB of HBM.
SNC4 is recommended in all memory modes when the application can be easily decomposed (for example, using MPI) to run on four NUMA nodes per socket.
HPC software can achieve higher levels of performance from high-bandwidth memory in the next generation Intel Xeon Scalable processor supporting HBM. HBM is exposed to software using three different memory modes: HBM-only, Flat, and Cache. It is straightforward for software to use these modes; HBM-only and Cache modes do not need any modifications to existing applications, while applications could utilize standard Linux NUMA utilities in Flat-Mode. Sub-NUMA clustering can reduce the effects of conflict misses in Cache mode and improve performance. Next Gen Intel Xeon Scalable processors plus HBM will provide unprecedented compute capability for improved performance of HPC and AI applications.
Connect with us and let us know how we can help you optimize your Next Gen Intel Xeon plus HBM systems for your workloads!
Ruchira Sasanka, Senior HPC Application Engineer
Ruchira is a Senior HPC Application Engineer in Intel Corporation’s Super Computer Group working on enabling HPC applications on various Intel processors and platforms. His current focus is on optimizing systems and applications for high-bandwidth memory. He received his PhD in Computer Science from the University of Illinois at Urbana-Champaign in 2005.
Chuck Yount, Principal Engineer
Chuck is a Principal Engineer in Intel Corporation’s Super Computer Group working on enabling and tuning HPC applications for current and future Intel-based clusters. His current focus is on developing optimizations and code-generation techniques on CPUs and GPUs for stencil-based operations such as finite-difference methods. He received a B.S. in Computer Engineering from North Carolina State and a Ph.D. in Electrical and Computer Engineering from Carnegie Mellon.
Intel technologies may require enabled hardware, software or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.