Introduction
In this second entry in a three-blog series, we’d like to show you how the latest Amazon Elastic Cloud Compute (EC2) M7i and M7i-flex instances featuring 4th Generation Intel® Xeon® Scalable processors could support your artificial intelligence (AI), machine learning (ML), and deep learning (DL) workloads. In our first blog, we introduced these new instances and how they can benefit users in a general way. Now, we’d like to dig more specifically into how AI, ML, and DL workloads perform on these new instances and how your workloads can benefit from using Intel processors.
One report already values the AI market at $136.55 billion USD and projects it to grow at a rate of 37.3% annually until at least 2030. (1) While you might attribute the growth to obvious AI uses such as the Google search engine or Tesla’s forays into self-driving vehicles, the advertising and media market currently holds the largest share of the global AI market. (2) AI and ML/DL workloads have infiltrated our world and their use is ever-expanding. Cloud service providers (CSPs) such as Amazon Web Services (AWS) have been investing in their AI/ML/DL services and infrastructure to help companies more easily and efficiently embrace these workloads. One such investment is in hosting instances featuring 4th Generation Intel Xeon Scalable processors and their built-in accelerators for AI workloads.
In this blog, we’ll discuss the specific details about how Intel processors and AWS instances are well-equipped for your AI workload needs. We’ll then dive into two popular ML/DL model types to show how these instances performed running these workloads.
The M7i Family and 4th Gen Intel Xeon Scalable Processors
As we covered in the previous blog, Amazon EC2 offers M7i and M7i-flex, both of which feature the latest-gen Intel Xeon processor. The primary difference is that M7i-flex delivers variable performance at a lower price. For sustained, compute-intensive workloads such as training or running machine learning models, the regular M7i instances are likely the best option, so we’ll focus on them in this blog. M7i instances range from 2 to 192 vCPUs to fit a range of needs. Additionally, the ability to attach up to 128 EBS disks to each instance ensures you’ll have plenty of storage to hold your dataset. The latest Intel Xeon processors come with several built-in accelerators that help enhance workload performance, as well.
All M7i instances come with Intel Advanced Matrix Extensions (AMX) accelerator enabled to help users increase deep learning performance. Intel AMX allows users to code their AI workloads using the AMX instruction set while letting non-AI workloads remain on the processor instruction set architecture (ISA). AMX is easy for developers to use because Intel has integrated the necessary optimizations into its oneAPI Deep Neural Network Library (oneDNN). This API is available in a variety of open-source AI frameworks such as PyTorch, TensorFlow, and ONYX. (3) According to Intel testing, 4th Gen Intel Xeon Scalable processors with AMX functionality can provide up to 10 times the inference performance of older processors. (4)
To ensure that you’re getting the most out of your AI, ML, and DL workloads on the latest M7i instances with Intel AMX, engineers and developers must properly tune their workloads. To help in this regard, Intel provides an AI tuning guide that details how to take advantage of the Intel processor advantages across several common models and frameworks. (5) The guide covers everything from OS-level optimizations to specific optimizations for PyTorch, TensorFlow, OpenVINO, and more. Intel also maintains the Intel Model Zoo GitHub resource, which includes pre-trained AI, ML, and DL models pre-validated to run on Intel hardware, guides for running and optimizing AI workloads, best practices, and more. (6)
Now that you know how Intel and the latest Intel Xeon processors can improve AI, ML, and DL workloads in general, let’s look at how these instances perform with two specific model types: object detection and natural language processing (NLP).
Object Detection Models
Object detection models drive apps and programs that scan images for classification. Several different models fall under this category, including those for 3D medical scans, self-driving vehicle cameras, facial recognition, and more. Two such models we’ll discuss are ResNet-50 and RetinaNet.
ResNet-50 is an image recognition deep learning model that uses a convolutional neural network (CNN) with 50 layers. Users train these models to recognize and classify objects in an image. Many pre-trained ResNet-50 models, including those available on Intel Model Zoo, (7) train on the large image database at ImageNet. (8) Most object detection models have either one or two stages, with two-stage models delivering higher accuracy but slower rates than single-stage models. While both ResNet-50 and RetinaNet are single-stage models, RetinaNet introduces the Focal Loss function feature that increases accuracy without sacrificing performance. (9)
Performance—how quickly these models can analyze images—matters greatly depending on the application of the model. End users don’t want a long delay while waiting for their devices to recognize them and unlock. Farmers need to detect plant diseases and harmful insect invasions quickly before they harm too many crops. Intel testing of M7i instances using the MLPerf RetinaNet model shows that these new instances substantially outperform the older M6i instances, analyzing up to 4.11 times as many samples per second. (10)
According to our ResNet-50 tests, performance scales well as you increase CPUs, so you can maintain strong performance regardless of dataset and instance size. For example, an M7i instance with 192 vCPUs gets eight times the ResNet-50 throughput as a 16vCPU instance. (11) (Note that it’s rare to get perfect linear scaling with most real-world workloads). You can also get more for your money by selecting higher-performing instances. In our RetinaNet tests, M7i instances analyzed up to 4.49 times as many samples per dollar than their same-sized M6i counterparts. (12) These results show how the M7i instances with 4th Gen Intel Xeon Scalable processors are a great choice for your object detection deep learning workloads.
Natural Language Processing Models
When you enter a question into a search engine or ask a question on a website’s chatbot, you’re almost certainly engaging with natural language processing engines. These NLP models are trained to recognize natural speech patterns to understand and engage with language. These models, such as those based on Bidirectional Encoder Representations for Transformers (BERT) machine learning, can go beyond storing and displaying text to understanding and contextualizing it. (13) Word processing software and phone texting apps now provide text prediction based on what the user has already written. While not all companies run an international search engine such as Google Search, even small companies find value in using chat boxes for initial interactions with customers. These companies need a chatbot that is clear, quick, and accurate. Because chatbots and many other applications of NLP models require real-time execution, performance is of utmost importance. With M7i instances and 4th Generation Intel Xeon processors, users can see performance improvements with NLP models such as BERT and RoBERTa, a modified, optimized version of BERT. According to one benchmark test, M7i instances running RoBERTa analyzed up to 10.65 times as many sentences per second as a Graviton-based M7g instance with the same vCPU count. (14) When we tested with BERT using the MLPerf suite, throughput again scaled well as we increased the vCPU count of M7i instances, with the 192-vCPU instance achieving over 4 times the throughput of the 32-vCPU instance. (15)
One reason for this outstanding performance from the M7i instances is the Intel AMX accelerator in the 4th Gen Intel Xeon Scalable processors we discussed earlier. With publicly available pre-optimized models for Intel processors and tuning guides for specific models such as BERT, Intel equips customers with everything they need to get the most out of their NLP workloads. (16) As we saw with RetinaNet, M7i instances delivered much better performance per dollar, up to 8.62 times that of a same-sized M7g instance. (17)
Conclusion
Cloud decision-makers should evaluate Amazon EC2 M7i instances featuring 4th Generation Intel Xeon Scalable processors for their AI, ML, and DL needs. With built-in Intel AMX acceleration, tuning guides, and optimized models for many popular ML workloads, these instances can deliver up to 10 times the throughput of Graviton-based M7g instances. Stay tuned for more blogs showing you how the latest M7i and M7i-flex instances can support other workload needs, as well.
Watch this short video to learn more about the M7i instances.
(1) https://www.grandviewresearch.com/industry-analysis/artificial-intelligence-ai-market
(2) https://www.grandviewresearch.com/industry-analysis/artificial-intelligence-ai-market
(6) https://github.com/IntelAI/models
(9) https://keras.io/examples/vision/retinanet/
(13) https://huggingface.co/blog/bert-101
Notices and Disclaimers
Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software, or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.