Introduction to Decision Support Workloads on Azure HDInsight with Intel Processors

VolgaSimsek · ‎11-13-2023

Introduction

In the digital age, data has become a precious commodity that drives business insights and innovation. Big data—vast amounts of structured and unstructured data—comes from a variety of sources, including IoT devices, social media platforms, online transactions, and more. This influx of information presents a goldmine of opportunities, but storing it and extracting meaningful insights from it can be challenging without the right tools.(1) And the volume is growing; Gartner forecasts the worldwide IoT meter market to double between 2020 and 2030.(2)

Big data analytics is the process of dissecting and interpreting these massive datasets to uncover patterns, trends, and correlations that might otherwise remain hidden. Using statistical analysis techniques such as clustering and regression, businesses can transform raw data into actionable insights, make data-informed decisions, optimize processes, and stay ahead of the competition.(3)

When you run big data workloads in the cloud, several factors go into optimizing performance, gaining insights earlier, and keeping expenses down. You must choose the right framework, such as Apache Spark or Apache Hadoop, for your workload. Once you’ve done that, selecting the right optimizations, libraries, and toolkits can accelerate data pipelines and reduce development time and expenses.

The right hardware gives your workloads the resources they need to analyze data quickly, but the options can be overwhelming. Which big data platform do you use? Should you opt for older, less expensive instances or newer ones that deliver better performance?

To help you answer these questions, we evaluated performance on Apache Spark 3.0 using the Microsoft Azure HDInsight service. The testing used a decision support workload and measured the time to complete queries on various dataset sizes. We will walk you through our workload stack and use our results to illustrate how hardware choices and software optimizations can affect data analysis workload performance.

Getting Started on HDInsight

Public cloud usage continues to grow as more companies transition to hosted workloads. Cloud service providers can typically offer a greater range of offerings than traditional on-premises hardware. However, with flexibility comes complexity, or at least a lot of options to choose from. HDInsight helps narrow down options, with Azure offering only a subset of their VM library that they’ve optimized for HDInsight clusters. Users must select instance size and specific processor, among other things.(4) We tested both 8 and 16 vCPU sizes with three different Intel processor generations to illustrate the performance differences you might see on your workloads.

Azure offers multiple methods to create an Azure HDInsights cluster, including the Azure Portal GUI, their CLI, Windows PowerShell, and more. If you use the GUI in the Azure Portal, a wizard walks you through six tabs to configure and create the cluster. First, you provide an Azure Subscription and Resource Group, name the cluster, choose the cluster type (Spark 3.0 for our testing), and add cluster credentials. Next, you define your storage, storage account, and optional SQL database types for services such as Hive and Ambari. Next, you choose networking options and encryption settings. Finally, you select the number and type of worker nodes for your cluster and the attached size and type of storage. The cluster creation wizard lets you tag your resources and review and create your cluster.(5)

We tested several CPU types on both 8 and 16 vCPU VM clusters. For the 8vCPU cluster, we chose 20 workers. For the 16vCPU cluster, we chose 10 workers. Both clusters automatically configured with two Head nodes and three Zookeeper nodes. We used Apache Spark, a popular framework for big data workloads. With Spark SQL, machine learning libraries, and more, it can process extremely large amounts of data distributed across a large network of servers and VMs in different ways. The Spark environment scales easily and is fault-tolerant due to its distributed nature. Companies use Spark for stream processing, machine learning, interactive analytics, static analytics, data integration, and other types of data processing.(6)

Decision Support Workload Performance

To illustrate the type of performance you could expect and the impact that hardware choices can have on performance, we tested our decision support workload on several configurations. Our test benchmark created a pre-populated dataset against which we ran a pre-configured set of 99 database queries representing a wide range of business questions and database query types. We tested three different dataset sizes—1TB, 3TB, 10TB—on 8vCPU and 16vCPU cluster sizes. We also tested each cluster with three different Intel processors that were available on Azure at the time of testing. For testing details, including the Azure VM cluster configurations we used, see the Test Configuration Details and Results document.

Our tests show that newer hardware significantly improved performance. In the cloud, uptime equals cost accrual, so less time to run queries can mean spending less money. To calculate TCO, we added the costs for the Azure resources per TB accrued during the run time of the workload.

Opting for the newer VM cluster lowered costs by up to 17.37%, making it a compelling choice for enhancing efficiency and optimizing expenses.

This trend of improved performance and cost savings extended to our 16vCPU testbeds. VM clusters featuring 8272CL CPUs reduced query runtimes by up to 14.46% compared to the 8171M cluster. The 2nd Gen CPU VM Clusters reduced costs by as much as 14.12%, an important consideration for users seeking to improve performance while keeping expenses in check.

When we compare the latest CPU generation HDInsight offers to even older processors, the benefits are even greater.

These results highlight the benefits of adopting newer-generation hardware for big data analysis workloads. Upgrading to newer CPU generations empowers organizations with enhanced performance, faster query processing, and cost savings.

Tuning for Performance

Now that we’ve shown how critical it is to choose the right hardware, we’d like to highlight the importance of software optimization. For each processor type in both clusters we tested, we ran each workload size two ways: with out-of-the-box (OOB) settings and with Spark optimizations. We then examined how optimizations improved performance as well as how each configuration compared to each other.

We share our specific tunings but also encourage users to check Intel(7) and Spark documentation(8) for tuning tips and tricks beyond what we used to ensure you’re getting maximum performance and efficiency for your workload. These values reflect our attempts to make use of more than 50% of the available compute which, in our observations, the default configuration was not able to do.

Some other observations we made that inform our optimizations include:

Reducing the default Lz4 block size improved performance.
Memory overhead and JVM heap sizes are important for performance and a good first place to look for adjustments to improve performance.
- A good memory allocation ratio we found was roughly 75% for Spark and at least 10% for overhead.
Increasing the broadcastTimeout setting to save long-running queries.

Adjusting the joinReorder parameters improved performance on long-running queries using joins between multiple tables. Our tuning reduced both runtime and cost for all three CPU types we tested. On our 8vCPU cluster, running queries using optimizations took up to 36.01% less time than on an unoptimized system.

Reducing the runtime led to cost advantages. Optimizations on the 16vCPU cluster decreased costs by up to 60.93% while those on the 8vCPU cluster decreased costs by up to 36.03%. Clearly, optimization can enable an organization to refine resource utilization, respond faster to user demand, and accommodate growing workloads.

Conclusion

Big data analysis is a driving force today, enabling businesses to mine insightful information from enormous datasets. The adoption of IoT devices, which collect data across many areas—from healthcare and retail to manufacturing and even smart cities—has added to the potential of big data analytics.

Our assessment of one big data analysis stack within Azure HDInsight emphasizes the significance of selecting the appropriate hardware and software components. Using CPUs from a more recent generation substantially improved performance, which decreased query runtimes and cut costs. When we compare 2nd Gen CPU-equipped VM clusters to their 1st Gen CPU-based counterparts, we see remarkable gains in query processing speed and cost. Employing optimizations delivered even greater improvements.

Organizations looking to gain a competitive edge and realize the full potential of their data in the constantly changing world of data-driven decision-making must adopt the appropriate hardware, software, and optimization strategies. By doing so, businesses can operate at peak efficiency, make better decisions, and position themselves for greater success in the future.

(1) https://www.intel.com/content/www/us/en/artificial-intelligence/analytics/what-is-big-data.html

(2) https://www.gartner.com/en/documents/3996804

(3) https://www.intel.com/content/www/us/en/artificial-intelligence/what-is-data-analytics.html

(4) https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-supported-node-configuration

(5) https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters

(6) https://developer.hpe.com/blog/spark-101-what-is-it-what-it-does-and-why-it-matters/

(7) https://www.intel.com/content/www/us/en/developer/articles/guide/xeon-performance-tuning-and-solution-guides.html#gs.30j4ji

(8) https://spark.apache.org/docs/latest/tuning.html

Notices and Disclaimers

Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See above for configuration details. No product or component can be absolutely secure. Your costs and results may vary.
Intel technologies may require enabled hardware, software or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.