The recent excitement over Artificial Intelligence (AI) contributes to increased computational resources demand and cost. As AI models get larger to support the equally growing Generative AI (GenAI) use cases, customers naturally are looking for more cost-effective methods of developing, training, and tuning their Foundational/Large Language models (FMs/LLMs).
Amazon Elastic Compute Cloud (Amazon EC2) M7i instances powered by custom 4th Gen Intel® Xeon® Scalable processors (code-named “Sapphire Rapids”) that bring Intel® Accelerator Engines to the masses using Xeon®.
This post will show how the M7i instances can enable cost-effective deployment of medium-sized LLMs with up to 13 billion parameters while achieving sub-100ms latency and up to 4x performance improvements over the M6i instances.
While LLMs like Amazon Titan and Anthropic Claude contain trillions of parameters, recent research (link) shows medium-sized models with under 13 billion parameters can match their accuracy for many enterprise use cases. With a fraction of the computational requirements, these medium LLMs unlock new possibilities for cost-effective and responsive deployment.
Sidebar:
- Tokens are words, letters, or sentences in a model's trained vocabulary, typically sub-words.
- Parameters are the number of internal weights in a machine-learning model. More parameters allow for modeling more complex patterns.
Llama2 inference in M7i instances
LLMs generate text autoregressively, producing one token at a time. For example, given an input sequence like "I enjoy walking" the model will sequentially output tokens to form a complete sentence or until it reaches the maximum predefined length. Each new token is predicted by passing the current partial output back into the model. LLMs produce fluent text by repeatedly feeding their predictions as input to extend the sequence.
Deployment of large language models depends on the following metrics:
- Prefill / 1st token latency: The time to generate 1st token (precomputes intermediate values called key-values and stores them in a software cache).
- Decoding / 2nd token latency: The time to generate 2nd + tokens (this uses precomputed key-value cache to generate tokens faster than 1st token)
- Tokens / Seconds – Number of tokens generated per second.
Prefill latency is compute-intensive because it needs to process all the tokens from input prompts. The Intel® Advanced Matrix Extensions (AMX) built-in accelerator on M7i instances reduces 1st token latency compared to other CPU instances. Moreover, decoding latency is memory bandwidth-intensive, as it needs to read all the model parameters from memory and uses a key-value cache to save on compute resources. M7i instances come with DDR5 memory to accelerate decoding latency.
The following figures show the latency data on M7i.8xlarge and M7i.16xlarge from Llama2 7b and 13b models.
*Batch size 1, Greedy search, input tokens 256 and 1024, P90 decoding latency for 128 output tokens
Figure 1-4. We show that Llama2 7b models in an M7i.8xlarge instance can execute under 50ms, making them suitable for latency-sensitive chatbot applications. Similarly, Llama 13B models can execute under 100ms in M7.16xlarge instances.
Cost savings compared to M6i instances
In the gen-over-gen comparison, we use the M6i.16xlarge and M7i.16xlarge instances. The M6i instances support Intel Advanced Vector Extensions instructions (AVX-512-FP32) and Vector Neural Network Instruction (VNNI/INT8) AI acceleration. Conversely, M7i instances built-in Intel AMX introduce BF16 acceleration plus additional INT8 gains.
To compare inference costs, we divide our experiment into two test cases. The first test case with INT8 on both instances, while we run the second test case on M6i instances with FP32 and M7i instances with BF16.
*Batch size 8, Beam width 1
When transitioning the FP32 model from M6i.16xLarge to the BF16 model on M7i.16xLarge, we noted a 2.5x increase in tokens per second. Likewise, transitioning the same INT8 model from M6i.16xLarge to M7i.16xLarge yielded a 1.4x increase in tokens per second.
AMX acceleration in M7i instances
The built-in Intel AMX accelerator in M7i instances accelerates the matrix multiplication using 2D Tiles registers and inbuilt tiles multiply instructions. This helps to maintain the latency low for LLMs and increases the throughput to reduce the cost of inference. Llama2 7b model throughput increases by 6.47x when batch size increases from 1 to 8, while the latency increases only by 1.26x.
Conclusion
Amazon EC2 M7i instances, powered by custom 4th Gen Intel® Xeon® Scalable processors, offer a compelling solution for deploying medium-sized AI models. With low latency and efficient AI acceleration capabilities, these instances can reduce inference costs by up to 2.5x compared to M6i instances. Models like Llama2 achieve over 90% accuracy in use cases such as chat and summarization, all while maintaining sub-100ms latency on M7i.
Intel's optimized software stack enables performance gains through quantization and other optimizations, taking full advantage of the built-in Intel® AMX accelerator. By leveraging these capabilities, users can benefit from the inference acceleration of M7i at a competitive total cost of ownership (TCO).
You can initiate the launch of M7i instances today to begin experiencing the cost-saving and performance benefits of AI applications. These new instances enhanced capabilities make them an attractive option for cost-sensitive yet high-performance AI workloads.
View these blogs from our series to learn more about how to redefine performance and accelerate workloads to meet business objectives.
- Redefine performance. Accelerate Workloads for Your Best Business Outcomes - Intel Community
- Amazon EC2 M7i & M7i Flex - AI Workloads in the Cloud - Intel Community
Disclaimer
Performance varies by use, configuration, and other factors. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates, results may vary. Intel does not control or audit third-party data, you should consult other sources to evaluate accuracy. For additional information visit www.Intel.com/PerformanceIndex
Performance data were measured by Intel on 17th Aug 2023; M7i.16xLarge and M7i.8xLarge instances in US-west-2 with Llama2 7B and Llama2 13b parameter models. OS-Ubuntu 22.04-LTS kernel 6.20.0-1009-aws, SW: PyTorch 2.1and Intel PyTorch Extensions 2.1/llm_feature_branch, Transformers 4.31, gcc-12.30.0. Models were quantized using Intel Extension for PyTorch.
Refer test scripts here: intel-extension-for-pytorch/examples/cpu/inference/python/llm at llm_feature_branch · intel/intel-extension-for-pytorch (github.com)
About the Authors
Antony Vance is a Principal Engineer at Intel® with 19 years of experience in computer vision, machine learning, deep learning, embedded software, GPU, and FPGA.
Greg Medard is a Solutions Architect with AWS Business Development and Strategic Industries. He helps customers with the architecture, design, and development of cloud-optimized infrastructure solutions. His passion is to influence cultural perceptions by adopting DevOps concepts that withstand organizational challenges along the way. Outside of work, you may find him spending time with his family, playing with a new gadget, or traveling to explore new places and flavors.
Mikołaj Życzyński is an AI Software Development Engineer, collaborates closely with clients to deliver a diverse array of AI solutions on Intel® Xeon® platforms.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.