Tuning and Inference for Generative AI with 4th Generation Intel Xeon Processors (Part 3 of 3)

Mohan_Potheri · ‎12-21-2023

We have had an introduction to Generative AI tuning and inference concepts in part 1 of the blog. In part 2 we looked at the concept of taking an open-source large language model and tuning it for a specific use case leveraging the latest Intel Xeon processors. In this final part 3, we will look at leveraging the latest Intel Xeon processors for Generative AI inference with a real world use case.

Inference with Xeon for Generative AI:

Intel Xeon offers a cost-effective, scalable, and versatile solution for LLM Inference. Intel Xeon is democratizing access to powerful generative models and unlocking their potential to use across various end user applications and industries.

Why CPUs for Inference?

While GPUs have long dominated AI training due to their parallel processing prowess, CPUs offer distinct advantages for inference:

Cost-effectiveness: CPUs are generally more affordable and readily available than high-end GPUs, making them accessible to a wider range of developers and researchers.
Scalability: CPU-based systems are easily scalable, allowing you to adapt your infrastructure to handle growing model sizes and computational demands.
Versatility: CPUs excel at diverse tasks beyond just AI, making them valuable for general-purpose computing alongside inference workloads.
New Instructions: Intel's advancements like AMX and DL Boost provide hardware-accelerated support for specific AI operations, significantly boosting CPU performance for inference.

Optimizing for CPUs:

Unlocking the full potential of CPUs for generative AI inference requires careful optimization:

Model Quantization: Reducing model precision from 32-bit to 8-bit can significantly shrink model size and accelerate inference without sacrificing accuracy.
Knowledge Distillation: Transferring knowledge from a larger, pre-trained model to a smaller CPU-compatible model can maintain performance while reducing resource requirements.

Real-World Applications:

The power of CPUs for generative AI is already being harnessed in various fields:

Drug Discovery: Researchers are using CPU-powered systems to generate novel drug candidates, accelerating the search for life-saving treatments.
Materials Science: CPUs are being used to design new materials with desired properties, leading to breakthroughs in fields like energy and aerospace.
Creative Content Generation: Artists and writers are exploring the potential of CPUs to generate original content, from poems and stories to music and paintings.

Inference for Falcon 7B with Amazon EC2 c7i instances:

The compute requirements for inference of the Falcon-7B model were analyzed through a sizing exercise. The metric used was latency seen for the inferencing run. The goal was to have a latency of less than 25 seconds for the response in the chat.

Category	Attribute	c7i
Run Info
	Benchmark	Inference Falcon 7-Billion Parameter Model with Hugging Face accelerate PyTorch 2.0.1 Intel Extensions for PyTorch 2.0.100
	Date	Nov 10-24, 2023
	Test by	Intel
CSP and VM Config
	Cloud	AWS
	Region	us-east-1
	Instance Type	c7i.8xlarge
	CPU(s)	16 cores
	Microarchitecture	AWS Nitro
	Instance Cost	1.428 USD per Hour
	Number of Instances or VMs (if cluster)	1
	Iterations and result choice (median, average, min, max)
Memory
	Memory	64 GB
	DIMM Config
	Memory Capacity / Instance
Network Info
	Network BW / Instance	12.5 Gbps
	NIC Summary
Storage Info
	Storage: NW or Direct Att / Instance	SSD GP2
	Drive Summary	1 volume 70 GB

Table 4: Compute Infrastructure for Falcon-7B Inference

The model that was tuned in the earlier phase was deployed with the compute infrastructure shown in Table 4. The software components used in the inference are shown in Table 5.

Category	Attribute	c7i
Run Info
	Benchmark	Inference using fine-tuned Falcon 7-B Model with Hugging Face accelerate PyTorch 2.0.1 Intel Extensions for PyTorch 2.0.100
	Dates	Nov 10-24, 2023
	Test by	Intel
Software
	Workload	Generative AI Fine Tuning
Workload Specific Details
	Command Line	*# Inference using fine-tuned Falcon 7B model:* *python vmw_peft-tuned-inference.py --checkpoints /mnt/data/llm/aws_best_model_dist1_aws/checkpoint-1100/ --max_length 200 --top_k 10*

Table 5: Workload details for inference

Inference Results:

The effectiveness of the Falcon-7B model for inference was tested before and after tuning. With the untuned Falcon-7B model the chatbot produced short answers to user queries with very little detail as shown in Figure 5.

Figure 5: Falcon-7B chatbot responses pre-tuning

The tuned model was then deployed and the chatbot was tested with similar questions as before.

Figure 6: Falcon-7B chatbot responses post tuning

The response from the tuned Falcon-7B chatbot show a more comprehensive understanding of the topics queried compared to the untuned model. All inference queries met the set SLA for response time of the application.

Conclusion:

Intel and AWS customers can use Xeons for tuning small to medium sized LLMs for their specific use cases. The Falcon-7B Large Language Model was tuned with Intel 4th Generation Xeon based Amazon EC2 C7i instances in a reasonable time of a few hours, which is acceptable for tuning. This shows that Enterprise customers can effectively leverage open source LLMs like Falcon-7B and tune it for their domain specific use cases with Intel Xeon based cloud infrastructure.

Many of these Generative AI applications are deployed at the edge where there are limitations in the amount of compute available. A reasonable sized instance that can be made available at the edge such as the c7i.8xlarge with 16 cores and 64 GB RAM with the latest Xeon hardware along with SW optimizations was able to meet the set SLA for inference for an LLM like Falcon-7B.

Through this blog we have demonstrated the efficacy of using Intel Xeon based cloud instances effectively for tuning and inferencing of publicly existing LLMs such as Falcon-7B.

References:

[1] https://www.intel.com/content/www/us/en/products/docs/processors/xeon-accelerated/4th-gen-xeon-scalable-processors-product-brief.html : With the most built-in accelerators of any CPU on the market, Intel® Xeon® Scalable processors offer the most choice and flexibility in cloud selection with smooth application portability.

[1] https://aws.amazon.com/ec2/instance-types/c7i/ : Amazon Elastic Compute Cloud (Amazon EC2) C7i instances are next-generation compute optimized instances powered by custom 4th Generation Intel Xeon Scalable processors (code named Sapphire Rapids) and feature a 2:1 ratio of memory to vCPU. EC2 instances powered by these custom processors, available only on AWS, offer the best performance among comparable Intel processors in the cloud – up to 15% better performance than Intel processors utilized by other cloud providers.

[1] https://huggingface.co/tiiuae/falcon-7b Falcon-7B is a 7B parameters causal decoder-only model built by TII and trained on 1,500B tokens of RefinedWeb enhanced with curated corpora. It is made available under the Apache 2.0 license.

[1] https://huggingface.co/datasets/timdettmers/openassistant-guanaco: The Guanaco[1] dataset is a subset of the Open Assistant dataset. This subset of the data only contains the highest-rated paths in the conversation tree, with a total of 9,846 samples.

[1] https://www.intel.com/content/www/us/en/developer/articles/technical/fine-tune-falcon-llm-with-hugging-face-oneapi.html: Fine Tuning Falcon-7B with hugging face and Intel OneAPI.

[1] https://www.youtube.com/watch?v=JNMVulH7fCo: Video showing techniques.