Increase Spark Performance with Intel CPUs and Gluten

RobertoBaturoni · ‎10-14-2024

Co-Authors:

Roberto Baturoni Toledo: Cloud Solution Engineer

Binwei Yang: Software Engineer (pioneer of the Apache Gluten Project)

Debashis Paul: Cloud Solution Architect

Aura Davila: Program Manager, Strategic Customers

As companies contend with ever-increasing volumes of data streaming in from devices, users, websites, and more, the tools and platforms they select to analyze this data become more important than ever. Big data analytics offers business-critical insights that can also be time-critical, making efficiency and performance paramount. With big data analytics on Apache Spark SQL, workloads tend to run continuously, with a need for high performance to speed time to insight. This means that companies can justify spending a little more overall to achieve better performance per dollar spent. In the previous blog, we explored Spark SQL performance on Google Cloud™ instances. In this next installment, we look at the ways Apache Gluten can improve processor performance in your data center.

Spark Facilitates Data Science at Scale

Many organizations use Apache Spark for batch and stream processing, machine learning and other AI applications, and large-scale SQL. Spark uses a distributed model to facilitate data science at scale; data resides on multiple servers across clusters. This distribution necessitates a certain amount of overhead when locating the data for any given query. Speed of queries, which translates to faster business decisions, is an important element of any Spark workload and is especially true of machine learning training workloads.

Using Gluten to Accelerate Spark

While Spark is an effective tool for speeding and simplifying big data processing, companies have been developing tools to enhance it. One such effort is Gluten, Intel’s Optimized Analytics Package (OAP) Spark-SQL execution engine, which speeds performance and offloads compute-intensive critical data processing to native accelerator libraries. Gluten relies on Velox (Meta’s open-source) C++ generic database acceleration library, a vectorized SQL processing engine to optimize query engines and data processing systems. Gluten is a plugin to Spark that acts as “a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.”(1) With Intel processor accelerators and the Apache Gluten plugin, users can add significant performance to their Spark workloads.

Gluten works by transforming Spark query execution plans into Substrait (a cross-language specification for data processing) and passing those now-readable plans to native libraries via JNI call. The execution plan is built out and offloaded to the native engine, where it is processed efficiently (Gluten also controls the native memory allocation) and returned to Gluten as a Columnar Batch. Gluten then returns the data to Spark JVM as ArrowColumnarBatch.

Gluten uses a fallback mechanism to invoke vanilla Spark to handle unsupported operators, and a shim layer to support multiple Spark versions. Gluten records metrics from the native engine and displays them in the Spark user interface.

The Gluten plugin uses Spark’s own framework, control flow, and JVM code while offloading as many compute-intensive data processing parts to native code as possible. Gluten doesn’t require any changes on the query end, so existing dataframe APIs and apps will work the same as before, but faster.

The Performance Improvements We Saw

In this section, we look at test results that illustrate how adding Gluten to your Spark applications can improve performance. For detailed configuration information, see the end of the blog. We used two different benchmark tools. One, based on TPC-DS, models a general-purpose decision support system with 99 individual database queries.(2) The other, based on TPC-H, models a general-purpose decision support system with 10 individual database queries.(3) For both, we measured the amount of time it took a single user to complete all of the queries once against our Spark SQL cluster.

4th Generation Intel® Xeon® Scalable Processors

First, we’ll look at the performance impact of adding Gluten to Spark SQL running on servers featuring 4th Generation Intel Xeon Scalable Processors. As the chart below shows, adding Gluten led to 3.12x the performance. On the TPC-H-like workload, the accelerator allowed the system to complete the ten database queries more than three times as fast. On the TCP-DS-like workload, Gluten more than doubled the speed of completing all 99 database queries. These improvements mean that answers would get into the hands of decision-makers faster, demonstrating the value of adding Gluten to your Spark SQL workloads.

5th Generation Intel® Xeon® Scalable Processors

Next, let’s examine how Gluten accelerates Spark SQL workloads on servers featuring 5th Generation Intel Xeon Scalable Processors. As the following chart shows, we saw even greater improvements than we did on the servers using older processors, with performance up to 3.34x as high when using Gluten. If you have servers of this generation in your data center, incorporating Gluten into your environment can allow you to get much more from your hardware and shorten time to insight.

Implications for Cloud

While we conducted these tests on bare metal hardware in a data center, they clearly demonstrate the potential of Gluten to improve performance, even in the cloud. If you run Spark in the cloud, not only will you see the benefits we discussed in our previous blogs, but you could also enjoy further performance improvements by adding Gluten.

Conclusion

Whether you run your Spark SQL workloads on servers featuring 5th Generation Intel Xeon Scalable Processors or the previous generation, completing analysis quickly is critical to your company’s success. Intel processors can boost performance with native libraries tuned to instruction sets, and Gluten can take advantage of this by offloading JVM data processing to native libraries.

Our testing demonstrated that adding the Gluten plugin to Spark SQL workloads can be a straightforward way to double or even triple the speed at which your servers complete database queries. By providing up to 3.34x the performance, utilizing Gluten can help your organization maximize data analytics workloads.

(1) https://github.com/apache/incubator-gluten?tab=readme-ov-file#readme

(2) https://medium.com/hyrise/a-summary-of-tpc-ds-9fb5e7339a35

(3) https://medium.com/hyrise/a-summary-of-tpc-ds-9fb5e7339a35

Notices and Disclaimer

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software, or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

brooksch · ‎10-15-2024

Great Article!

AMK007 · ‎10-15-2024

Great article -thanks for sharing!

pallavijaini · ‎10-15-2024

Thanks for sharing

MANOJ3 · ‎10-29-2024

Useful