Intel Co-Sponsors New Phase of MIT DSAIL Program for Instance-Optimized Data Systems Research

Nesime_Tatbul · ‎11-28-2022

Nesime Tatbul is a senior research scientist in the Parallel Computing Lab at Intel Labs and acts as Intel’s lead PI for DSAIL.

Highlights:

Intel co-sponsors a new phase of the Data Systems and Artificial Intelligence Lab (DSAIL) university research program at the Massachusetts Institute of Technology (MIT).
Over the next four years, DSAIL will generalize the vision of instance optimization to a wide variety of data systems and applications.

A new phase of our Data Systems and Artificial Intelligence Lab (DSAIL) university research  program at the Massachusetts Institute of Technology (MIT) officially kicked off on October 20-21, 2022, during an annual meeting in Cambridge, MA. Established in 2018, the program pioneered Machine learning (ML) for data systems research, exploring the use of modern  ML techniques in improving the design and performance of large-scale data systems and applications. This includes enhancing or replacing key components of traditional data systems (e.g., index structures, scheduling algorithms, query optimizers) with their learned counterparts to allow them to adjust automatically to changing data distributions and query workloads. These learned components have been applied in novel use cases through joint projects with Intel, including ML-enhanced DNA sequence search and query optimization. Furthermore, the team built SageDB, an “instance-optimized” accelerator for the open-source PostgreSQL database, showing how these learned components can be integrated together in an end-to-end system that outperforms expert-tuned databases on analytical database workloads.

“Through close collaboration with Intel and our corporate sponsors, we have been able to show that ML can be used to develop novel data systems that successfully adapt to the data, workloads, and hardware environments in which they operate and successfully integrated those systems into a number of real-world applications.” Sam Madden, DSAIL Co-Director and MIT College of Computing Distinguished Professor.

Research Agenda

One of the major thrusts of DSAIL’s continued research agenda is to build instance-optimized systems. These systems self-adjust to handle a workload with near-optimal performance under a given set of operating conditions as if built from scratch for that specific use case. Instance optimization is motivated by growing trends in  the variety of data-intensive applications and the heterogeneity of hardware/software platforms where they are being deployed. While specialized solutions can lead to better performance, manually developing and tuning them for each individual use case is not economically feasible. The team’s work to date has shown promise in leveraging ML to overcome this challenge.

In recent years, there have been more endeavors to apply machine learning to algorithmic and system problems, many of which are driven by DSAIL. These works include ML applications ranging from video processing to storage layouts to log-structured merge trees and many other data management tasks. However, so far, most research has been focused on improving individual components. In this second phase of DSAIL, a key goal will be to investigate how learned components can be combined to build an entire, holistically instance-optimized system that does not require administrator intervention. In collaboration with co-sponsors Amazon, Google, and Intel, DSAIL will also generalize the vision of instance optimization to a wide variety of data systems and applications through novel designs across edge-to-cloud deployment settings. Examples include hybrid transactional/analytical processing (HTAP) systems, key-value stores, data lakes, and visual data analytics systems. In conjunction with common sense reasoning based on domain knowledge (e.g., represented as knowledge graphs or probabilistic models), ML techniques will continue to play a central role in the lab’s upcoming research agenda.

Instance-Optimized Clouds

Achieving instance optimization at the cloud scale introduces a new set of challenges and opportunities for research. The increasing complexity of cloud service infrastructures and their cost-performance tradeoffs are getting harder for cloud developers and users to navigate. More fundamentally, the disaggregation of data services in the cloud challenges the performance of traditional data system architectures due to their monolithic designs. In a joint vision paper published at the Conference on Innovative Data Systems Research (CIDR) earlier this year, Intel and MIT proposed a new metadata-rich cloud storage format called Self-organizing Data Containers (SDCs) to enable flexible data layouts that can self-adapt to client workloads. SDCs have three key properties that will enable automated performance optimizations in disaggregated database architectures:

They flexibly support a variety of complex physical data layouts beyond simple column orientation via replication and partitioning.
They explicitly represent rich metadata that can be used for optimizations, such as histograms and data access patterns.
They can self-organize themselves over time as they are exposed to client query workloads.

Preliminary experiments with real-world visual dashboarding applications indicate that even simple layout optimizations enabled by workload awareness of SDCs can achieve 3-10x speedups over traditional range partitioning. This work represents a foundational first step toward achieving instance optimization in modern cloud databases.

Instance-Optimized Video Processing

Video processing is a prime example of a data-intensive application domain that can substantially benefit from instance optimization. High volumes of video data are generated daily by a wide variety of applications, from social media to traffic monitoring. Applying state-of-the-art ML algorithms to efficiently analyze these datasets in real-world settings presents an interesting set of challenges and opportunities. Prior research by MIT DSAIL (e.g., MIRIS) and Intel Labs (e.g., VDMS) has demonstrated that there is potential for significant performance gains by tailoring these algorithms to the specific data and workload contexts that they are used in. Going forward, the DSAIL team will explore extending these efforts on multiple fronts to enable automated video search and analytics optimizations.

For instance, Video Extract-Transform-Load (V-ETL) is one of the research problems that the DSAIL team is currently investigating in the context of large-scale video data warehouses. In order to prepare them for analytical queries, live video streams with varying content dynamics must be processed through user-defined ingestion pipelines that consist of expensive computer vision tasks, such as object detection and tracking. For resource and cost efficiency, there is a need for adaptive parameter tuning in such pipelines (e.g., frame rates, image resolutions, etc.) with changing video dynamics. The team is working on a novel approach that will continuously maintain high video content quality with a low cloud cost budget, even under peak load conditions.

“Given the success of the first phase of DSAIL, we are excited to support continued research in this area. This work has the potential to directly inform future design decisions within cloud data centers and enable a wide range of new applications. We look forward to jointly exploring these opportunities with our DSAIL collaborators at MIT and co-sponsors Amazon and Google. “ Pradeep Dubey, Intel Senior Fellow and Director of the Parallel Computing Lab at Intel Labs.