Accelerating AI/ML and Data-centric Applications with Temporal Caching

Allison_Goodman · ‎03-04-2022

The tech industry is in the midst of “the fifth epoch of distributed computing.”[i] It is a crucial evolutionary moment, driven by the need to support artificial intelligence (AI), machine learning, and data-centric workloads that require real-time data insights. The world is not short of data to power these insights. However, computing resources are often bottlenecked by the inability to quickly retrieve the necessary data. This is especially true in a massive, distributed computing system, where multiple processors share the same data. In a distributed system, the concept of memory caching can be complicated.

During the 2022 International Solid-State Circuits Conference (ISSCC), Dr. Frank Hady presented the need for system-level near memory compute for overcoming the data bottlenecks in future AI data processing systems. The session was titled, “We’ve rethought our commute; Can we rethink our data’s commute?” In his session, Dr. Hady made the following points:

AI compute needs are increasing exponentially and demand that data centers maximize performance and minimize system energy consumption.
There is a direct correlation between TCO and system energy, because data movement through the memory hierarchy is costly.

The talk introduced system-level relevance criteria for understanding the likely success of near memory compute solutions. This blog continues that discussion by introducing a novel approach to data management that unifies temporal caching data access methods to maximize performance in these increasingly important applications. The foundation for this new approach is Intel® Optane™ persistent memory (PMem), serving as a secondary memory tier.

Background and Terminology

Data is retrieved from main memory and stored in cache memory according to the principle of locality. This principle acknowledges that programs tend to access a relatively small portion of the memory address space at any given time. The general rule of thumb is that the average application spends 90 percent of its time accessing about 10 percent of the data.[ii] Now, there are two different types of locality:

Temporal locality (location in time) is a program’s tendency to use data items often during program execution. If a program uses an instruction or data variable fairly frequently, then this data or instruction should be kept close to the CPU, because it is likely to be referenced again soon.
Spatial locality (location in space) refers to items whose addresses are held in nearby memory storage, and therefore tend to be referenced again soon.

Note: The concept of “semantic locality” as part of knowledge management is beyond the scope of this blog.

If a distributed computing system is using “true sharing,” the involved processors must synchronize their caches to ensure program correctness, which can slow down cache access. In this blog, we focus on temporal caches, which can be accessed far faster than most other caches because programs with high temporal locality tend to have fewer true-sharing cache misses. A true-sharing cache miss can occur when two processors access the same data word, invalidating the cache block in one processor’s cache.

Traditional I/O-centric data management practices that rely on Load-Store instructions to access in-memory data and POSIX file I/O to access persistent data do not provide the consistency, durability, or integrity that AI and other data-intensive applications require. As an alternative, we advocate using secondary memory options like High Bandwidth Memory (HBM) and Intel Optane PMem to unify temporal cache data access and minimize latency. In essence, Intel Optane PMem creates a tiered memory system, where the DRAM serves as an L4 cache for the large-capacity and low-latency Intel Optane PMem.

Advantages and Challenges Associated with Temporal Caching

Temporal caching is used in a variety of use cases:

Time-series data. Many scheduling, banking, medical, and scientific applications manage temporal data. A temporal database stores data relating to time instances. More specifically, it associates data with a start time and an end time value. Data can be time-stamped with two concepts of time: a valid time interval (when the data event occurs in modeled reality) and a transaction time interval (the period over which the event information is stored in the database).
Network operations. Temporal content caching can help improve network operation and end user experience by reducing the distance that packets must travel within a network.[iii]

While the benefits of a temporal cache are clear, developers face several challenges when designing them. These challenges include dimensioning the temporal cache (especially relevant to content delivery networks), as well as improving the energy efficiency of memory management. The latter challenge can be partially addressed by using runtime-assisted dead region management (RADAR) to predict and evict dead blocks in last-level caches.[iv]

The Current Temporal Caching Model

An example of a general temporal cache model is a parallel job that is designed to implement a query execution plan against immutable data, which is presented as a table described by a schema. (A query execution plan is a sequence of steps used to access data in a SQL relational database management system.) The table is generated from the log constructs that consist of the current mutable (unsealed) log segment, as well as many immutable (sealed) log segments. Amazon Redshift, Databricks Delta Lake, F1 Lightning, and Procella all share a common architecture that uses this model, as depicted in Figure 1.

Figure 1. Common high-level data warehouse architecture.

As shown in the diagram, the current design uses storage – locally attached NAND SSDs – to buffer intermediate results of a query following the first scan/merge operation. This data must be paged into memory before further processing. Unfortunately, the access latency for NAND SSDs is orders of magnitude higher than if this data was accessed in memory.

Aside from high latency, complexity is another problem with the current temporal caching model. Developers typically use one interface (Load-Store) for memory-resident data, and a second I/O-based interface (POSIX) for data that is in storage. The rest of this blog explores a fascinating question: What if I/O operations could be eliminated from all operations in the query execution plan after the initial scan/merge?

Implementing a Memory-Centric Temporal Cache

As we mentioned at the beginning of this blog, processor speeds continue to increase faster than the ability to access main memory, making effective use of memory caches more important. Therefore, let’s explore the possibility of enabling a query engine to access datasets via a unified Load-Store interface – that is, in memory – using the same log as described for the “current model.” This log ingests new data while ensuring that order, durability, and integrity requirements are met.

In the new model, the query engine fills buffers in a DRAM-backed heap when accessing data that resides in the unsealed log segment. Sealed log segments that are stored in this shared distributed log are accessed by copying the data to a temp file in a direct access (DAX)-enabled file system.[v] This file is then memory mapped, thus treating these segments as a pool that is backed by secondary memory. Sealed log segments that have migrated to tertiary object storage can be accessed similarly by using the same copy/memory-mapping technique. A depiction of this proposed framework is shown in Figure 2.

Figure 2. Memory-centric data warehouse implementation.

This copy/memory-map facility is an early embodiment of the envisioned memory-centric temporal cache, in which the query engine ensures that data is uniformly accessed via Load-Store operations.

What’s Next for Memory-centric Temporal Caching?

The following are several opportunities for optimizing memory-centric temporal caching:

The columnar nature of the layout of the data is well suited for single instruction, multiple data (SIMD)-based operations, which could improve operational efficiency.
In-network and computational storage techniques can make it possible to offload predicate and aggregate operators as pushdowns at the source of the data, either in-stream or in the object store. A “pushdown” improves SQL query performance by moving parts of the SQL computation as close to the data as possible, which can help filter data before the result of the query is returned to the processor.[vi]
Programmable switches, SmartNICs, and workload libraries may potentially enable the scaling of the shared, distributed log’s total order to encompass all the shards of a table, as well as the tables within the system.
The emergence of fabric technologies such as Compute Express Link (CXL)[vii] memory pooling may eliminate the copy operation in the temporal cache. Instead, data sources will be mapped directly into secondary memory that is data-paged as needed. This approach can significantly improve access latency for ever-increasing data querying. The response of the compute ecosystem to the CXL initiative has been very enthusiastic.

For more information, see the following:

Additional research papers are also available.