Intel and Collaborators Present Latest Database Research at VLDB 2023

ScottBair · ‎08-28-2023

Scott Bair is a key voice at Intel Labs, sharing insights into innovative research for inventing tomorrow’s technology.

Highlights:

The 49th International Conference on Very Large Data Bases (VLDB) will run from August 28th to September 1st, 2023, in Vancouver, Canada.
Intel presents eight co-authored contributions across the research, industrial, and demonstrations tracks of the main conference.
Intel researchers also have a paper accepted at the Second International Workshop on Composable Data Management Systems (CDMS), which is co-located with the VLDB Conference.
Intel Labs is proud to highlight Nesime Tatbul, a senior research scientist at Intel's Parallel Computing Lab (PCL), for receiving one of this year’s conference awards for Distinguished Associate Editors in recognition of their impeccable service to PVLDB Volume 16.

This year’s International Conference on Very Large Data Bases (VLDB) will run from August 28th to September 1st, in Vancouver, Canada. Research talks, tutorials, demonstrations, and workshops at the conference will cover a variety of topics, including issues in data management, database and information systems research, since they are the technological cornerstones of the emerging applications of the 21st century.

Intel is pleased to share eight co-authored contributions across the research, industrial, and demonstrations tracks of the main conference. These publications include results from joint research projects with collaborators from academia and industry, such as those from MIT DSAIL and Meta. Intel researchers also have a paper accepted at the Second International Workshop on Composable Data Management Systems (CDMS), which is co-located with the VLDB Conference. This paper introduces Gluten – an Intel-led open-source software project with contributors from the broader industrial community.

In addition to the works presented, Intel Labs is proud to highlight Nesime Tatbul, a senior research scientist at Intel's Parallel Computing Lab (PCL), for receiving one of this year’s conference awards for Distinguished Associate Editors in recognition of their impeccable service to PVLDB Volume 16.

To learn more about Intel’s co-authored publications read on below:

Research Track

Robust Query Driven Cardinality Estimation under Changing Workloads

Query driven cardinality estimation models learn from a historical log of queries. They are lightweight, having low storage requirements, fast inference and training, and are easily adaptable for any kind of query. Unfortunately, such models can suffer unpredictably bad performance under workload drift, i.e., if the query pattern or data changes. This makes them unreliable and hard to deploy. In this paper, researchers analyze the reasons why models become unpredictable due to workload drift, and introduce modifications to the query representation and neural network training techniques to make query-driven models robust to the effects of workload drift. First, they emulate workload drift in queries involving some unseen tables or columns by randomly masking out some table or column features during training. This forces the model to make predictions with missing query information, relying more on robust features based on up- to-date DBMS statistics that are useful even when query or data drift happens. Second, they introduce join bitmaps, which extends sampling-based features to be consistent across joins using ideas from sideways information passing. Finally, they show how both of these ideas can be adapted to handle data updates. The paper shows significantly greater generalization than past works across different workloads and databases. For instance, a model trained with our techniques on a simple workload (JOBLight-train), with 40𝑘 synthetically generated queries of at most 3 tables each, is able to generalize to the much more complex Join Order Bench- mark, which include queries with up to 16 tables, and improve query runtimes by 2× over PostgreSQL. The work also shows similar robustness results with data updates, and across other workloads. Researchers discuss the situations where we expect, and see, improvements, as well as more challenging workload drift scenarios where these techniques do not improve much over PostgreSQL. However, even in the most challenging scenarios, the proposed models never perform worse than PostgreSQL, while standard query driven models can get much worse than PostgreSQL.

Extract-Transform-Load for Video Streams

Social media, self-driving cars, and traffic cameras produce video streams at large scales and cheap cost. However, storing and querying video at such scales is prohibitively expensive. This paper proposes to treat large-scale video analytics as a data warehousing problem: Video is a format that is easy to produce but needs to be transformed into an application-specific format that is easy to query. Analogously, this work defines the problem of Video Extract-Transform- Load (V-ETL). V-ETL systems need to reduce the cost of running a user-defined V-ETL job while also giving throughput guarantees to keep up with the rate at which data is produced. Researchers found that no current system sufficiently fulfills both needs and therefore propose Skyscraper, a system tailored to V-ETL. Skyscraper can execute arbitrary video ingestion pipelines and adaptively tunes them to reduce cost at minimal or no quality degradation, e.g., by adjusting sampling rates and resolutions to the ingested content. Skyscraper can hereby be provisioned with cheap on-premises compute and uses a combination of buffering and cloud bursting to deal with peaks in workload caused by expensive processing configurations. Experiments showed that Skyscraper significantly reduces the cost of V-ETL ingestion compared to adaptions of current SOTA systems, while at the same time giving robustness guarantees that these systems are lacking.

PLIN: A Persistent Learned Index for Non-Volatile Memory with High Performance and Instant Recovery

Non-Volatile Memory (NVM) has emerged as an alternative to next-generation main memories. Although many tree indices have been proposed for NVM, they generally use B+-tree-like structures. To further improve the performance of NVM-aware indices, this work considers integrating learned indexes into NVM. The challenges of such an integration are twofold: (1) existing NVM indices rely on small nodes to accelerate insertions with crash consistency, but learned indices use huge nodes to obtain a flat structure. (2) the node structure of learned indices is not NVM friendly, meaning that accessing a learned node will cause multiple NVM block misses. Thus, this paper proposes a new persistent learned index called PLIN. The novelty of PLIN lies in four aspects: an NVM-aware data placement strategy, locally unordered and globally ordered leaf nodes, a model copy mechanism, and a hierarchical insertion strategy. In addition, PLIN is proposed for the NVM-only architecture, which can support instant recovery. Researchers also present optimistic concurrency control and fine-grained locking mechanisms to make PLIN scalable to concurrent requests. Researchers conducted experiments on real persistent memory with various workloads and compare PLIN with APEX, PACtree, ROART, TLBtree, and Fast&Fair. The results show that PLIN achieves 2.08x higher insertion performance and 4.42x higher query performance than its competitors on average. Meanwhile, PLIN only needs ∼30 𝜇s to recover from a system crash.

Similarity Search in the Blink of an Eye with Compressed Indices

Nowadays, data is represented by vectors. Retrieving those vectors, among millions and billions, that are similar to a given query is a ubiquitous problem, known as similarity search, of relevance for a wide range of applications. Graph-based indices are currently the best performing techniques for billion-scale similarity search. However, their random-access memory pattern presents challenges to realize their full potential. This work presents new techniques and systems for creating faster and smaller graph-based indices. To this end, researchers introduce a novel vector compression method, Locally-adaptive Vector Quantization (LVQ), that uses per-vector scaling and scalar quantization to improve search performance with fast similarity computations and a reduced effective bandwidth, while decreasing memory footprint and barely impacting accuracy. LVQ, when combined with a new high-performance computing system for graph-based similarity search, establishes the new state of the art in terms of performance and memory footprint. For billions of vectors, LVQ outcompetes the second-best alternatives: (1) in the low-memory regime, by up to 20.7x in throughput with up to a 3x memory footprint reduction, and (2) in the high-throughput regime by 5.8x with 1.4x less memory.

Industrial Track

AutoSteer: Learned Query Optimization for Any SQL Database

This paper presents AutoSteer, a learning-based solution that automatically drives query optimization in any SQL database that exposes tunable optimizer knobs. AutoSteer builds on the Bandit optimizer (Bao) and extends it with new capabilities (e.g., automated hint-set discovery) to minimize integration effort and facilitate usability in both monolithic and disaggregated SQL systems. The team successfully applied AutoSteer on PostgreSQL, PrestoDB, SparkSQL, MySQL, and DuckDB – five popular open-source database engines with diverse query optimizers. They then conducted a detailed experimental evaluation with public benchmarks (JOB, Stackoverflow, TPC-DS) and a production workload from Meta’s PrestoDB deployments. The evaluation shows that AutoSteer can not only outperform these engines’ native query optimizers (e.g., up to 40% improvements for PrestoDB) but can also match the performance of Bao-for-PostgreSQL with reduced human supervision and increased adaptivity, as it replaces Bao’s static, expert-picked hint-sets with those that are automatically discovered. Researchers also provide an open-source implementation of AutoSteer together with a visual tool for interactive use by query optimization experts.

TPCx-AI - An Industry Standard Benchmark for Artificial Intelligence and Machine Learning Systems

Artificial intelligence (AI) and machine learning (ML) techniques have existed for years, but new hardware trends and advances in model training and inference have radically improved their performance. With an ever increasing amount of algorithms, systems, and hardware solutions, it is challenging to identify good deployments even for experts. Researchers and industry experts have observed this challenge and have created several benchmark suites for AI and ML applications and systems. While they are helpful in comparing several aspects of AI applications, none of the existing benchmarks measures end-to-end performance of ML deployments. Many have been rigorously developed in collaboration between academia and industry, but no existing benchmark is standardized. In this paper, we introduce the TPC Express Benchmark for Artificial Intelligence (TPCx-AI), the first industry standard benchmark for end-to-end machine learning deployments. TPCx-AI is the first AI benchmark that represents the pipelines typically found in common ML and AI workloads. TPCx-AI provides a full software kit, which includes data generator, driver, and two full workload implementations, one based on Python libraries and one based on Apache Spark. We describe the complete benchmark and show benchmark results for various scale factors. TPCx-AI’s core contributions are a novel unified data set covering structured and unstructured data; a fully scalable data generator that can generate realistic data from GB up to PB scale; and a diverse and representative workload using different data types and algorithms, covering a wide range of aspects of real ML workloads such as data integration, data processing, training, and inference.

Big Data Analytic Toolkit: A General-Purpose, Modular, and Heterogeneous Acceleration Toolkit for Data Analytical Engines

Query compilation and hardware acceleration are important technologies for optimizing the performance of data processing engines. There have been many works on the exploration and adoption of these techniques in recent years. However, a number of engines still refrain from adopting them because of some reasons. One of the common reasons claims that the intricacies of these techniques make engines too complex to maintain. Another major barrier is the lack of widely accepted architectures and libraries of these techniques, which leads to the adoption often starting from scratch with lots of effort. This paper proposes Intel Big Data Analytic Toolkit (BDTK), an open-source C++ acceleration toolkit library for analytical data processing engines. BDTK provides lightweight, easy-to-connect, reusable components with interoperable interfaces to support query compilation and hardware accelerators. The query compilation in BDTK leverages vectorized execution and data-centric code generation to achieve high performance. BDTK could be integrated into different engines and helps them to adapt query compilation and hardware accelerators to optimize performance bottlenecks with less engineering effort.

Demonstrations Track

QO-Insight: Inspecting Steered Query Optimizers

Steered query optimizers address the planning mistakes of traditional query optimizers by providing them with hints on a per-query basis, thereby guiding them in the right direction. This paper introduces QO-Insight, a visual tool designed for exploring query execution traces of such steered query optimizers. Although steered query optimizers are typically perceived as black boxes, QO-Insight empowers database administrators and experts to gain qualitative insights and enhance their performance through visual inspection and analysis.

CDMS Workshop

The Gluten Open-Source Software Project: Modernizing Java-based Query Engines for the Lakehouse Era

Year-on-year, exponential data growth, and the corresponding growth in machine learning’s appetite to process that data is transforming the industry’s data management discipline. In response, the data lakehouse architecture has emerged. The transformative nature of the lakehouse architecture and the need to enable a diverse set of query engines to access data that resides in a lakehouse is motivating a refactoring of capabilities in these query engines. Industry’s response is the composable data management system (CDMS). This paper introduces the Gluten open-source software (OSS) project – an embodiment of the CDMS concept. Gluten is a Java Native Interface (JNI) bridge that enables Java-based query engines to offload/accelerate processing to native acceleration libraries, such as the Meta-led Velox OSS project.