Optimize Google Cloud Costs for Spark SQL Workloads with Intel

RobertoBaturoni · ‎10-14-2024

Co-Authors:

Roberto Baturoni Toledo: Cloud Solution Engineer

Debashis Paul: Cloud Solution Architect

Aura Davila: Program Manager, Strategic Customers

As AI dominates the news, companies are striving to gain every advantage they can from the myriad of data streaming in from devices, users, websites, and more. Big data analytics continues to drive innovation, providing crucial insights into new opportunities, AI technologies, and customer demographics. It’s not a question of whether you will need to add or expand your big data analytics—it’s a question of when. In our Spark blog series, we’ll focus on Apache Spark™ SQL big data analytics workloads and how you can get the biggest bang for your buck with Intel processors. In this blog, we’ll look at Spark SQL performance and value results on Google Cloud™ instances. The next blog will discuss the ways Gluten can improve processor performance and take advantage of the best TCO improvements for your Spark deployments.

Combining Apache Spark with Google Cloud Instances Powered by the Latest Intel Processors

Many enterprise customers use the powerful Apache Spark framework for processing large volumes of data in the cloud. For example, with some use cases, such as retail transaction processing, failing to finish jobs in a timely manner can lead to service-level agreement (SLA) violations, which in turn can lead to penalties, lower customer dissatisfaction, and damage the business’s reputation. Optimizing Apache Spark performance helps companies meet deadlines, process more data, and handle new projects. It also allows admins to troubleshoot and address any issues without jeopardizing overall performance, resulting in greater resiliency and adaptability.

Apache Spark workloads typically ingest data from multiple sources into files or batches, such as in applications that require ingesting data from IoT sensors or in streaming data applications where it is crucial to unify data processing across multiple languages in real time. It then processes these and transforms the processed data into a target dataset, which companies use to generate business intelligence dashboards, provide insights to decision-makers, or deliver data to other parties.

In a well-architected Spark cluster system, such as Google Cloud with N4 5th Generation Intel Xeon instances, the enhanced processing power enables efficient streaming and processing of large volumes of data. This lets companies deliver the processed data to dependent systems or vendors on time.

Combining open-source Spark and with Intel Xeon 5th Gen processors lets companies improve the efficiency and cost-effectiveness of AI workloads, especially in the data preprocessing stages. The latest Intel processors enable Spark to handle complex ETL ("extract, transform, and load") tasks faster and more efficiently, reducing the time required to prepare large datasets for AI models. In addition to shortening the AI development cycle, this lets companies optimize resource usage, which lowers costs. For AI applications that involve large and complex datasets, such as those in deep learning or real-time analytics, the combination of Spark and the latest Intel processors offers critical scalability. Organizations can deploy AI models with speed and accuracy and gain the real-time insights that support effective, data-driven decisions.

Google Cloud Offerings

Google Cloud offers a range of service options, from infrastructure-as-a-service (IaaS) instances to managed Spark services when moving your Spark SQL workloads to the cloud. For serverless, integrated Spark environments, be sure to take a look at Google Cloud-managed services.(1) However, for workloads where you prefer to create, manage, scale, and have more control over your own Spark environment, the IaaS option is the way to go.

Google Cloud offers many different instance families to choose from, which they categorize by workload resource needs. These categories include general-purpose, storage-optimized, compute-optimized, memory-optimized, and accelerator-optimized. As their naming would imply, the instances in these categories include different ratios of memory and CPU cores, better storage performance, or GPUs to meet the needs of various workload requirements.(2) Additionally, you can select different vCPU-to-memory ratios within instance families, with “highcpu” or “highmem” types of instances. High memory instance types are better for workloads like Spark, which are memory intensive, large-scale data transformations, and large databases, improving performance and execution times.

Google Cloud also features a variety of block storage options to meet various performance and capacity requirements, hitting the right balance between performance and cost. For example, Standard Persistent Disks hard drives are a good choice for low-cost, standard performance needs, and locally attached SSD options offer the better performance.(3) To help you choose the best options for your workload, Google Cloud offers design guides, pricing calculators, comparison guides and more.

For our testing, we decided to focus on the general purpose “highmem” Google Cloud instances, as Spark SQL is memory intensive. However, our decision making wasn’t finished there. Users also choose an instance size, as well as the specific series within the instance family they want to use. Older instance series with older processors often run cheaper, but you may be sacrificing performance by using legacy hardware. You can also choose among processor manufacturers including Intel and AMD. In the general-purpose family, Google offers N-, C-, E-, and T- series instances. The N-series are suggested for things like virtual desktops, medium-traffic web apps, and batch processing. The C-series offer higher CPU frequencies and network limits, and they are best for workloads such as high-traffic web apps, game servers, and network appliances. The E-series instances are for background tasks, low-traffic web servers, and development. Finally, the T-series are great for media transcoding and scale-out workloads.(4) We will explore the new C4 instances in an upcoming blog. For now, let’s look at the testing we did on the N4 instance featuring the 5th Gen Intel® Xeon® Scalable processors, an older N2 instance with Previous Generation Xeon 3rd Gen Scalable processors and an N2D instance with AMD processors in the N series. We also tested a C3 instance featuring 4th Gen Intel Xeon Scalable processors and a C3D instance with AMD processors in the C series. Read on to see how instances with newer Intel processors can provide a better value for Spark SQL workloads.

Performance Overview

In this section, we look at the performance data we gathered comparing the various instance types and families we tested. For detailed configuration information, see the end of the blog.

Gen Over Gen

First, we’ll look at just the instances that feature Intel Xeon Scalable processors to show how your choices can impact your workload performance and value. We used a benchmark based on TPC-DS that models a general-purpose decision support system with 99 individual database queries.(5) We measured the amount of time it took a single user to complete all 99 queries once against our Spark SQL instance clusters. When we tested the 80 vCPU instances, the N4-highmem-80 with 5th Gen Intel Xeon Scalable processors finished the workload 1.13 times as fast as the N2-highmem-80 instances with older 3rd Gen Intel Xeon Scalable processors, with 1.15 times the performance per dollar.

When we compared Gen to Gen the same N4-highmem-80 instances to the C3-highmem-88 instance with 4th Gen Intel Xeon Scalable processors, we saw that the N4 was 1.18 times as fast to complete the queries with a commanding 1.38 times the performance per dollar. Note that the C3 series doesn’t offer an 80vCPU instance size, so we chose the closest size with 88 vCPUs.

As these results show, investing in newer instances with newer Intel processors not only increases Spark SQL performance but also provides more value. For every dollar you spend on N4 instances, you’re getting up to 1.38 times the performance compared to the older instances.

Competitive

Now that we’ve compared the N-series instances featuring 5th Gen Intel Xeon Scalable processors to older instances, we can compare them to instances with AMD processors. First, we’ll compare older N2D series instances that can feature either 2nd or 3rd Gen AMD EPYC™ processors. The N4 instance with Intel processors finished the queries 1.30 times as fast as the N2D instance, with 1.19 times the performance per dollar.

Finally, we compared the N4 instance with 5th Gen Intel Xeon Scalable processors to the C3D instance with 4th Gen AMD EPYC processors. Note that the C3D series does not offer an instance with 80 vCPUs, so we opted for the closest option at 90 vCPUs, giving the C3D instance a small advantage. Our studies show that even with fewer vCPUs, the N4 instance achieved only slightly lower performance, but with 1.21 times the performance per dollar.

These results show that Google Cloud instances with the latest Intel processors can provide the best performance and value compared to older Intel instances as well as AMD processor-backed instances for Spark SQL workloads.

Conclusion

Integrating Apache Spark with newer Google Cloud instances featuring Intel Xeon 5th Gen processors is a powerful way to optimize workloads, enhance performance, and reduce operational costs. Our results show that these newer instances, even if more expensive, can result in much better value. Choosing instances with the latest 5th Gen Intel Xeon Scalable processors can provide up to 1.38 times the performance per dollar, making them the obvious choice for your Spark SQL workloads.

Stay tuned for our next blog, which will discuss the Gluten Spark optimization.

(1) https://cloud.google.com/blog/products/data-analytics/simplify-data-processing-and-data-science-jobs-with-spark-on-google-cloud

(2) https://cloud.google.com/compute/docs/machine-resource

(3) https://cloud.google.com/blog/topics/developers-practitioners/google-cloud-block-storage-options-cheat-sheet

(4) https://cloud.google.com/compute/docs/general-purpose-machines#n4_series

(5) https://medium.com/hyrise/a-summary-of-tpc-ds-9fb5e7339a35

Notices and Disclaimer

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software, or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

AMK007 · ‎10-15-2024

Great article -thanks for sharing!

premiumsolutions · ‎12-05-2024

This is a very informative and well-detailed breakdown of the benefits of combining Apache Spark with Google Cloud instances powered by the latest Intel processors. The performance improvements, cost-efficiency, and scalability you’ve highlighted make a strong case for adopting these newer technologies for big data and AI workloads.

I particularly appreciate the clarity in explaining the differences between instance families and how specific configurations, like high-memory instances, cater to memory-intensive workloads such as Spark SQL. The benchmarking results and the performance-per-dollar comparison provide valuable insights for organizations making infrastructure decisions.

For businesses already leveraging big data analytics or planning to expand their AI capabilities, your post offers practical guidance on achieving better results while optimizing costs. Looking forward to the next blog about how Gluten can further enhance processor performance—sounds like a fascinating read!