Optimizing Large-Scale Data Systems with AI

Rick_Johnson · ‎09-05-2023

Posted on behalf of Nesime Tatbul

Nesime has been overseeing Intel's university research programs at MIT around data management systems since joining Intel Labs about ten years ago. The current program, Data Systems and Artificial Intelligence Lab (DSAIL), primarily focuses on exploring the use of AI/ML for enhancing and optimizing large-scale data systems and their enterprise applications.

We are seeing a lot of progress in AI/ML recently, and new AI techniques have started penetrating nearly every application domain and across all fields of computer science. For example, some companies now use AI to automate the creation and management of software and systems. Others have begun using large language models (LLMs) for various tasks, including research and coding. At Intel Labs and DSAIL, we use novel techniques and developments in AI/ML to enhance, optimize, and automate data systems and their applications. This post highlights some of our recent work involving AI and large-scale data systems.

Applying AI to Large-Scale Data Systems: Research by Intel Labs and DSAIL

In November 2022, Intel co-sponsored a new phase of DSAIL that focuses on instance optimization. Instance optimization is where you create an entire automated system or solution for a specific use case or you deploy an automated system in a specific hardware or software setting for a particular workload. Traditionally you would build a general-purpose system and then fine-tune it based on your specific use case. However, this approach is no longer economically feasible because so much specialization is happening in applications, software, and hardware. It is no longer possible to manually build these systems that give you the optimal performance or cost for those highly specialized situations.

Instance optimization plays a role in many of our current DSAIL and Intel Labs projects. Here are a couple of examples:

Learned Query Optimization: Bao and AutoSteer

Applying machine learning techniques to query optimization can bring substantial performance improvements. Still, it also poses many practical challenges, such as high training overhead, poor tail performance, or difficulty adapting to changing workload conditions. To address these challenges, at DSAIL, we introduced the Bandit Optimizer (Bao), a method that automatically steers a traditional query optimizer in a way to improve end-to-end query execution performance. More specifically, given a pre-determined collection of “hint-sets” (a hint-set indicating which subset of query rewrite rules should be considered in query planning), Bao learns to steer an already existing query optimizer by helping it choose the right hint-set to use for every incoming query. This way, potential planning mistakes of traditional query optimizers can be avoided.

At Intel Labs, we have generalized Bao into a framework called AutoSteer. AutoSteer makes it easy to apply the technology behind Bao to any SQL database that exposes tunable optimizer knobs. This is achieved through an automated hint-set discovery approach, which removes the need for manually designing system-specific hint-sets and makes them more flexible to use under changing workload conditions. As such, AutoSteer further expands the practical applicability of steering optimizers such as Bao.

We have tested AutoSteer on popular open-source databases, including DuckDB, MySQL, PostgreSQL, PrestoDB, and SparkSQL. Our experiments show that AutoSteer can improve SQL query performance by up to 40% on well-known benchmarks. We have released AutoSteer as open-source software on GitHub and presented a paper and a demo on AutoSteer at the recent VLDB Conference.

Instance-Optimized Clouds via Self-Organizing Data Containers

We have recently started researching instance optimization in the setting of disaggregated cloud data systems. In a joint vision paper, we propose a new storage format for the cloud — self-organizing data containers (SDCs).

SDCs capture rich metadata such as histograms and data access patterns together with data, which can then be used for performance optimizations. These optimizations include a variety of complex data layouts that can adapt to client query workloads. In contrast to simple layouts (e.g., range partitioning) used by traditional cloud storage formats, SDCs can self-optimize their layouts in a workload-aware manner. This will be a key enabler for achieving instance optimization for cloud data workloads.

AI-Optimized Data Systems and Intel Technologies

At Intel Labs and DSAIL, we are rethinking the design and implementation of data systems to make them more performant and adaptive. We are also looking at how the AI-optimized data systems we are building could benefit from Intel technologies and vice versa.

How could AI-optimized data systems benefit from Intel technologies?

We have done some fascinating work in the direction of extending the instance optimization techniques we have down to the underlying hardware. For example, we have used Intel technologies to improve large-scale data indexing and sorting. In both cases, learning-enhanced data processing algorithms jointly developed with DSAIL were further optimized to Intel hardware for substantial performance improvements.

Learned Indexing on Intel Hardware

In indexing, we combined Intel's CPU-optimized algorithms in DNA sequence search with machine learning enhancements. We saw a significant improvement in performance by combining the two techniques, compared to what could be achieved when applying them in isolation. Hardware and ML working together got us to a point where we now have one of the fastest implementations of an algorithm that is widely used in the community. You can learn more about this research in our joint paper, LISA: Towards Learned DNA Sequence Search. Our solution is also available on GitHub as open-source as part of the Intel Labs Trans-Omics Acceleration Library (TAL).

Learned Sorting on Intel Hardware

In sorting, we built a powerful system around Intel’s Core i5-12600K processor and a novel sorting technique that uses machine learning. We call this new sorting system ELSAR – External LearnedSort for ASCII Records. In 2022, we entered ELSAR into the Sort Benchmark Competition, and it became the new record holder in energy-efficient sorting for the JouleSort Indy category. Our system completed the sorting task using only 63 KJoules and achieved 159,000 sorted records per joule — a 40% higher sorting performance than the previous record holder. You can find more information about the solution behind ELSAR in our research paper and the competition within this Intel Community blog post.

Intel Hardware and AI: A Winning Combination for Large-Scale Data Systems

The examples above show how AI-optimized solutions can work together with Intel hardware to deliver the best performance. While the benefits mostly impact database system performance, one could also achieve potential benefits regarding portability and productivity. Going forward, we will continue to explore the benefits of combining Intel hardware and AI-powered solutions in other data management tasks and applications.

Acknowledgments: This blog post is based on joint work with MIT DSAIL and Intel Labs researchers.

Notices and Disclaimers

Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Intel technologies may require enabled hardware, software, or service activation.