Using Machine Learning to Characterize Database Workloads

MiroslavDzakovic · ‎06-29-2023

Databases have been helping us manage our data for decades. Like much of the technology that we work with on a daily basis, we may begin to take them for granted and miss the opportunities to examine our use of them—and especially their cost.

For example, Intel stores much of its vast volume of manufacturing data in a massively parallel processing (MPP) relational database management system (RDBMS). To keep data management costs under control, Intel IT decided to evaluate our current MPP RDBMS against alternative solutions. Before we could do that, we needed to better understand our database workloads and define a benchmark that is a good representation of those workloads. We knew that thousands of manufacturing engineers queried the data, and we knew how much data was being ingested into the system. However, we needed more details.

“What types of jobs make up the overall database workload?”

“What are the queries like?”

“How many concurrent users are there for each kind of a query?”

Let me present an example to better illustrate the type of information we needed.

Imagine that you’ve decided to open a beauty salon in your hometown. You want to build a facility that can meet today’s demand for services as well as accommodate business growth. You should estimate how many people will be in the shop at the peak time, so you know how many stations to set up. You need to decide what services you will offer. How many people you can serve depends on three factors: 1) the speed at which the beauticians work; 2) how many beauticians are working; and 3) what services the customer wants (just a trim, or a manicure, a hair coloring and a massage, for example). The “workload” in this case is a function of what the customers want and how many customers there are. But that also varies over time. Perhaps there are periods of time when a lot of customers just want trims. During other periods (say, before Valentine’s Day), both trims and hair coloring are in demand, and yet at other times a massage might be almost the only demand (say, people using all those massage gift cards they just got on Valentine’s Day). It may even be seemingly random, unrelated to any calendar event. If you get more customers at a peak time and you don’t have enough stations or qualified beauticians, people will have to wait, and some may deem it too crowded and walk away.

So now let’s return to the database. For our MPP RDBMS, the “services” are the different types of interactions between the database and the engineers (consumption) and the systems that are sending data (ingestion). Ingestion consists of standard extraction-transformation-loading (ETL), critical path ETL, bulk loads, and within-DB insert/update/delete requests (both large and small). Consumption consists of reports and queries—some run as batch jobs, some ad hoc.

At the outset of our workload characterization, we wanted to identify the kinds of database “services” that were being performed. We knew that, like a trim versus a full service in the beauty salon example, SQL requests could be very simple or very complex or somewhere in between. What we didn’t know was how to generalize a large variety of these requests into something more manageable without missing something important. Rather than trusting our gut feel, we wanted to be methodical about it. We took a novel approach to developing a full understanding of the SQL requests: we decided to apply Machine Learning (ML) techniques including k-means clustering and Classification and Regression Trees (CARTs).

k-means clustering groups similar data points according to underlying patterns.
CART is a predictive algorithm that produces a human-readable criteria for splitting data into reasonably pure subgroups.

In our beauty salon example, we might use k-means clustering and CART to analyze customers and identify groups with similarities such as “just hair services,” “hair and nail services,” and “just nail services.”

For our database, our k-means clustering and CART efforts revealed that ETL requests consisted of seven clusters (predicted by CPU time, highest thread I/O, and running time) and SQL requests could be grouped into six clusters (based on CPU time).

Once we had our groupings, we could take the next step, which was to characterize various peak periods. The goal was to identify something equivalent to “regular,” “just before Valentine’s” and “just after Valentine’s” workload types—but without really knowing upfront about any “Valentine’s Day” events. We started by generating counts of requests per each group per each hour based on months of historical database logs. Next, we used k-means clustering again, this time to create clusters of one-hour slots that are similar to each other with respect to their counts of requests per group. Finally, we picked a few one-hour slots from each cluster that had the highest overall CPU utilization to create sample workloads.

The best thing about this process was that it was driven by data and reliable ML-based insights. (This is not the case with my post-Valentine’s massages-only conjecture, because I didn’t have any gift cards.) The workload characterization was essential to benchmarking the cost and performance of our existing MPP RDBMS and several alternatives. You can read the IT@Intel white paper, “Minimizing Manufacturing Data Management Costs,” for a full discussion of how we created a custom benchmark and then conducted several proofs of concept with vendors to run the benchmark.