Data collection and analysis are more important than ever in making business decisions. As organizations collect enormous volumes of data from both internal and external sources, they increasingly engage machine learning (ML), deep learning (DL), and other artificial intelligence (AI) in their data analysis workloads. With the right tools, this data can generate insight into new business frontiers—providing customers with customized experiences, identifying fraud and misuse, and yielding other business opportunities.
Graphical processing units (GPUs) are powerful tools for AI workloads, but they can be expensive. They have also become very difficult to source.
Increasingly, some organizations believe that their workloads are large and complex enough to require the compute prowess of one or more GPUs or clusters of cloud machines. In reality, many of these analytical AI workloads can be comfortably supported using CPUs in the cloud. You might work at a smaller company with relatively little data, which your organization might divide into manageable subsets. In these environments, the nature of your data analysis might not demand top-speed performance.
If you’re looking to run your CPU-backed AI workloads in the public cloud, read on to see what we found in our testing on Microsoft Azure and Amazon Web Services (AWS).
ML and Other AI Workloads
Machine learning workloads train with different types of data. Our testing focused on three specific models: ResNet-50, Wide & Deep, and BERT.
ResNet-50 is a convoluted neural network that analyzes and classifies images. Once you’ve trained a model to recognize a certain object in an image, you can use that model to find this object in new images. ResNet-50 workloads dramatically help with medical imaging, self-driving cars, facial recognition, and so much more.
Wide & Deep (W&D) models combine wide linear models and deep neural networks. These models allow for analyzing data in both a specific and a general manner. A common example of a Wide & Deep application is one that lets users request food delivery by entering the specific type of food they want (for example, “shrimp scampi”) within their geographical area. This approach provides a better user experience than an app that lets you order from only one specific restaurant. The linear model in a W&D app increases the likelihood of precisely matching your specific request, while the neural network supports more generic requests such as “seafood.” Another real-world example of a W&D model is the YouTube algorithm, which not only finds videos that match exact search terms but also suggests videos that are similar to those the user watches.
The last model type we used in testing, BERT (Bidirectional Encoder Representations from Transformers), analyzes text as a natural language processor. Similar to the way that ResNet-50 models train with images, BERT models train with text. After learning common word and phrase combinations, these models can predict text. You’ve probably become accustomed to the way that word processing applications, smart phone keyboards, and email applications can anticipate what you’re about to say once you start typing a sentence or how you might respond to a message you’ve just received. Whether you find these features convenient, creepy, or a little of each, BERT models are likely behind them.
Choosing the Right Instance
When looking into the broad options for moving your ML workloads to the cloud, there’s a lot to consider. Both AWS and Azure offer extensive turn-key services in the AI domain; these range from support for specific ML applications to full-service, end-to-end ML suites such as Azure Machine Learning or Amazon Lex, which helps create chatbots for customer service interactions and engagements.
The goal of our tests was to determine which basic VMs/instances could better perform inference tasks within these three models. For this reason, rather than using special services, we ran pre-trained models on basic VM offerings. We tested a range of instance types and sizes, so that you can compare your workload needs to help determine the best fit for your organization.
The Intel Factor
The most important aspect of any system running inference tasks is compute capability. Here, we restrict our discussion to AI workload sizes and types appropriate for the broadly available and supported CPU VMs and their deeply established ecosystem.
While the number of CPU cores and threads play a crucial role in performance, so do the underlying processor technologies and accelerators. The latest 3rd Generation Intel Xeon Scalable processors build on earlier generations to include a number of features that directly benefit deep learning workloads. In the first generation of Intel Xeon Scalable processors, Intel introduced AVX-512, which added ultra-wide 512-bit vector operations capabilities to enhance computational tasks. Beginning with 2nd Generation Intel Xeon Scalable processors, Intel expanded the AVX-512 benefits with Intel Deep Learning Boost, which uses Vector Neural Network Instructions (VNNI) to further accelerate AI/ML/DL workloads. This boost offers better cache utilization, improves DL performance, and helps avoid bandwidth bottlenecks inherent to DL performance. Recently, with the latest 3rd Generation Intel Xeon Scalable processors, Intel introduced BFloat16 for workloads that don’t require high precision but do require high computational AI resources.
To read more about Intel deep learning enhancements, see https://www.intel.com/content/dam/www/public/us/en/documents/product-overviews/dl-boost-product-overview.pdf.
To see how AWS instances featuring Intel processors compare against AMD-backed instances -- delivering up to 2.94 times the frames per second with ResNet50 workload in one case— read on.
Figure 1: Relative ResNet50 performance of 96-vCPU M6i and M6a instances. Higher numbers are better. Source: Principled Technologies.
For those choosing an instance for deep learning and other AI workloads on AWS, we recommend considering several factors. First, from the thousands of instances available in many regions worldwide, select an instance “family” that fits your needs. Microsoft Azure offers many categories of VMs, such as General Purpose and Memory Optimized. For this example, let’s choose the memory-optimized E-Series VMs. The series includes older versions such as the Esv3 VMs featuring older Intel processors, and newer ones with the latest 3rd Generation Intel Xeon Scalable processors. VMs with other processor generations and VMs with enhanced storage and networking offerings are also part of this family.
For our inference workloads, high-performance storage wasn’t a necessity, so we selected the Esv5 series for comparison. These VMs feature the 3rd Gen Intel Xeon Scalable processors in sizes from 2 to 104 vCPUs. (Note that Esv5 VMs support Azure Ultra Disks, which are helpful for workloads requiring high-performance storage; by not including temporary storage, Azure is able to offer these VMs at a lower cost than the Edsv5 VMs.) Once you’ve chosen the number of vCPUs your workload needs, you can start moving your workload to the cloud.
Amazon also categorizes their hundreds of thousands of instances by the specific hardware component for which they have optimized. This creates categories such as General Purpose, Compute Optimized, Memory Optimized, Storage Optimized, and Accelerated Computing. Because Accelerated Computing uses GPUs, it falls outside the scope of this blog, so we’ll look at only the other four families, which are available and supported in all regions worldwide.
Likely, you’ll want to choose from General Purpose or Compute Optimized instances. Because they typically have a better memory-to-vCPU ratio, we'll use the general-purpose family for this discussion. We know that Intel processors offer great benefits for computational workloads, making the M6i instances featuring 3rd Gen Intel Xeon Scalable processors a good place to start. The only other thing we need to do is select an instance size that fits our workload. The M6i instance series offers everything from a very small 2-vCPU instance to a very large 128-vCPU instance. A bare-metal version of the 128-vCPU instance is also available.
If you need help choosing the number of vCPUs you need for your workload, or are curious about how performance differs at various sizes, continue reading.
AI/ML Performance Testing
We often read about new features and performance upgrades to hardware, but it’s not always easy to know how these improvements will actually impact your workloads. How can you know whether upgrading your instances to newer hardware would be worth the effort? That’s where this testing can help.
Advantages of Intel Xeon Scalable Processors
In our Azure testing on Esv4 VMs, we saw the impact of the accelerators introduced with the Intel Xeon Scalable processors compared to older Esv3 VMs with older E5-2673 v4 processors. In ResNet50 tests, we observed that the Esv4 VMs processed up to 8.4x as many images per second as the Esv3 VMs. With Wide & Deep, we saw up to 3.48x as many samples per second with Esv4 VMs vs. Esv3. New technologies can make a tremendous difference in your workload performance, which can reduce the number of instances you need to purchase and maintain.
Figure 2: Relative ResNet 50 performance of 8-vCPU VMs. Higher numbers are better. Source: Principled Technologies.
3rd Generation vs. 2nd Generation
Next, we ran tests on all three workloads we discussed previously—ResNet-50, Wide and Deep, and Bert—on AWS M6i instances featuring the latest 3rd Gen Intel Xeon Scalable processors and M5n instances with 2nd Generation processors. Our ResNet-50 tests showed that on both 16- and 96-vCPU instances, the M6i instance outperformed the M5n instances by about 20%. With a BERT inference workload, the M6i instances offered up to 45% more throughput than the m5n instances. And, finally, the M6i Wide & Deep performance was up to 33% better than that of the M5n instances.
Figure 3: Relative Wide & Deep performance of 96-vCPU VMs. Higher numbers are better. Source: Principled Technologies.
While the increases in performance were less dramatic than the ones we saw when moving from much older processors to the Intel Xeon Scalable processors, choosing the latest processors can still provide markedly improved performance on your workloads, which can in turn lead to savings.
Intel Competitive Performance
How do these instances and their processor technologies fare against options backed by AMD? On AWS, we again compared all three workloads using the M6i instances featuring 3rd Generation Intel Xeon Scalable processors to a general-purpose VM series. This time, we chose the M6a VMs featuring 3rd Generation AMD processor—and saw that the M6i instances outperformed them on all three workloads.
The M6i VMs running the ResNet-50 workload achieved up to 2.94x the images-per-second rate of the M6a instances. On the BERT workload, the difference was even larger, with the M6i instances delivering up to 6.4x the performance of the M6a instances. And, finally, on the Wide & Deep workload, the M6i instances with 3rd Generation Intel Xeon Scalable processors processed up to 1.67x as many frames per second as the M6a instances with AMD processors.
Figure 4: Relative BERT performance of 16-vCPU M6i and M6a instances. Higher numbers are better. Source Principled Technologies.
For your AI, ML, and DL workloads, choosing cloud instances enabled by Intel processors featuring accelerators such as AVX-512 and VNNI is an excellent strategy for maximizing workload performance.
When choosing which of the many public cloud instances are best for your learning workloads, keep in mind the ways that instances backed by Intel Xeon Scalable processors can outperform instances with older and competing processors.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.