Good AI Starts with Good Data: Battling Silent Data Corruption with Computational Storage

Michael_Mesnier · ‎05-07-2024

Today’s AI, particularly Large Language Models and other forms of Deep Learning, already requires enormous amounts of training data, and with AI in its infancy, data volumes are expected to grow. This not only introduces new system bottlenecks (CPU, memory, I/O, and storage), but it raises new concerns about the real risks of silent data corruption. Organizations don't want to be questioning the quality of their training data when it comes to building AI models.

Storage systems can repair corrupted data using replication or erasure coding, but performing data integrity checks takes time. It’s best to perform these checks continuously and in the background, a process known as data scrubbing, to minimize disruption to applications. Ideally, corrupted data is repaired before an application needs it. However, in practice, these integrity checks are often run infrequently or, worse, disabled entirely due to the cost.

How can computational storage assist with data integrity checks?

Reducing I/O and associated processing is exactly the motivation behind computational storage, so data scrubbing represents a real use case. Self-scrubbing storage, so to speak.

Checking the integrity of one file is not the problem. You simply read the file, calculate a checksum (or hash) of the file, and compare that against a previously stored checksum. If the checksums agree, the file’s integrity has not been compromised. If not, you reconstruct the file.

But you must do this over your entire data set. It’s a read-intensive workload where you read every file, from every file system, on every drive in your storage cluster. It’s an application killer if left unchecked, as it can consume all available I/O and increase load on host CPUs and memory.

This is where computational storage can assist. By offloading integrity checks to block storage, we can perform data integrity checks in situ (inside a single SSD or storage server) and save the costs associated with processing the I/O and running the integrity checks on the host CPU. SSDs and storage servers already have a variety of built-in engines for device-internal data integrity checks, such as CRC-32C and CRC-64, and these same accelerators can be used for end-to-end file checksums.

However, we must first teach block storage about “files” and how to perform an end-to-end checksum on a file, which may be scattered across many regions of a storage device. Without computational storage techniques, a storage device cannot directly process files for a host, as it lacks data awareness. A storage device only sees sectors (hard drives) or pages (SSDs), not files and directories.

Our computational storage research platform

Researchers at Intel Labs created a research platform to teach block storage how to “see” host data structures, like files, and subsequently process the data. This has allowed us to tackle the challenges of computational storage and explore various use cases.

Our research platform, which is based on the NVMe protocol, moves compute functions to either a storage server (e.g., NVMe/TCP) or a single SSD. This approach aligns well with industry-standard programming models that provide Computational Storage Functions (CSFs) via a Computational Storage Array (CSA) or Computational SSD (CSD). In both cases, offloading work to storage can reduce the host’s CPU load, memory footprint, I/O and, in the case of a CSA, network traffic.

Our computational storage stack includes numerous layers as shown in the figure at the top of this blog.

The application layer provides a convenient file-based interface and abstracts away most computational storage details. Applications simply request work on files (e.g., search, filter, uncompress, checksum). This same layer can interface with other computing layers, like Intel® oneAPI and FaaS.

The scheduling layer is for optimal scheduling across resources. The aggregation layer helps in processing data that is spread across multiple storage devices, which is the case with erasure-coded data. Finally, the device layer speaks NVMe. It creates computational storage commands and interfaces directly with CSAs and CSDs.

In the case of a CSA, we have an additional layer that virtualizes and manages compute resources. This provides the foundation for a secure, multi-tenant programming environment. Within this block storage server, various silicon and systems software are used to accelerate compute and I/O-intensive operations. Our solution uses Intel® Xeon® processors with support for CRC-64 (via the Data Streaming Accelerator) – a powerful end-to-end check for corrupted data.

We are using this platform both for research and for ecosystem enabling.

Our ecosystem partners

Computational storage will require strong ecosystem support, and we are actively working with ISVs, IHVs, and SNIA. ISV engagement is at the top of the stack (the application layer) and IHV engagement is at the bottom (the device layer).

We are working closely with SSD vendors, including Solidigm, to build CSD prototypes optimized for Intel server platforms. Scott Shadley (Director, Long-Term Strategy) at Solidigm indicates, “Solidigm’s PoC CSD is based on a Gen-5 SSD product, which includes a low-cost, high-efficiency and high-performance ASIC for data integrity calculations. We look forward to working with Intel to optimize the solution for server platforms and integrate into a complete E2E software stack.” More information on Solidigm’s future CSD can be found in their recent blog.

Intel Labs future computational storage work

Computational storage has had many false starts, going back decades, and the industry still lacks a widespread use case. We believe that data scrubbing could be that use case. The enormous data growth fueled by AI will make offloaded data integrity checks required, not just nice to have.

With a widespread use case in place, an ecosystem will mature and become a foundation for other use cases. The same computational storage protocol that we’re using for data scrubbing can be used for functions higher up in the stack, to accelerate other operations like AI (training and inference) and big data (sorting, searching, filtering).

We look forward to the opportunities that computational storage will bring and are actively inviting others to collaborate. Email michael.mesnier@intel.com to learn more and join in this collaborative effort.

For more information, also see this Intel Labs blog.