Cloud
Examine critical components of Cloud computing with Intel® software experts
119 Discussions

Tuning and Inference for Generative AI with 4th Generation Intel Xeon Processors (Part 2 of 3)

Mohan_Potheri
Employee
0 0 42.7K

We introduced Generative AI tuning and inference concepts in part 1 of the blog series. In this part 2 we will look at the concept of taking an open-source large language model and tuning it for a specific use case leveraging the latest Intel Xeon processors.

Base Model: The Falcon-7B

The Falcon series represents a cutting-edge collection of language models developed by the Technology Innovation Institute in Abu Dhabi. Released under the Apache 2.0 license, Falcon-7B [i]stands out as the inaugural "truly open" model, boasting capabilities that rival numerous existing closed-source models. This development is highly promising for practitioners, enthusiasts, and industries alike, as it paves the way for a plethora of exciting use cases.

 The Falcon-7B operates effectively with only around 15GB of memory, making it conducive to inference and fine-tuning even on consumer-grade hardware.

 

Mohan_Potheri_0-1702768347591.png

 

Figure 1: Falcon Open-Source Generative AI LLM (Image Source: Voicebot )

Additionally, Technology Innovation Institute (TII) has introduced instruct versions of these models, namely Falcon-7B-Instruct and Falcon-40B-Instruct. These experimental variants have undergone fine-tuning with a focus on instructional and conversational data, making them particularly suited for popular assistant-style tasks. For those seeking a quick engagement with the models, these instruct versions present an optimal solution. Moreover, individuals can craft their own custom instruct versions based on the diverse datasets cultivated by the community—an upcoming section will provide a step-by-step tutorial.

The training regimen for Falcon-7B involved a substantial 1.5 trillion tokens, aligning with contemporary models optimized for inference. A pivotal factor contributing to the superior quality of Falcon models lies in their training data, with over 80% derived from RefinedWeb—an innovative, expansive web dataset rooted in Common Crawl. Departing from the traditional approach of compiling data from various curated sources, TII prioritized scaling and enhancing the quality of web data.

Fine-tuning dataset:

The Guanaco[ii] dataset is a subset of the Open Assistant dataset. This subset of the data only contains the highest-rated paths in the conversation tree, with a total of 9,846 samples. This dataset was used to train Guanaco with QLoRA. IPEX with Intel AMX and AMP with Bfloat16 as given in the Intel site.

Tuning Falcon 7B with Amazon EC2 c7i instances:

The pre-existing Falcon 7B Model was used for the tuning exercise.  Example given in this article[iii] was used for the Generative AI tuning leveraging Amazon EC2 c7i.metal-24xl instances.

Streamlining and enhancing the fine-tuning process for Falcon-7B is achieved with good efficiency through the integration of SFTTrainer, Intel PyTorch Extensions (IPEX) along with Intel AMX, and AMP with Bfloat16. The intricate tasks involved in fine-tuning are simplified through SFTTrainer, which offers a higher-level abstraction, thereby simplifying the overall process. This approach ensures a more streamlined and efficient fine-tuning experience for Falcon-7B.

Leveraging the cutting-edge hardware features embedded in Intel Xeon processors, IPEX and AMX play pivotal roles in optimizing the fine-tuning process. This integration not only streamlines the task at hand but also introduces support for the latest optimizations and devices even before they are integrated into the open-source PyTorch*. Additionally, it facilitates AMP training and inference, where parameters and operations are converted to Bfloat16, thereby not only accelerating Intel AMX but also preserving full 32-bit accuracy when needed. This comprehensive approach ensures both efficiency and effectiveness in the fine-tuning of Falcon-7B, leveraging the advanced capabilities of Intel Xeon processors.

 

Category

Attribute

c7i

Run Info

 

 

 

Benchmark

Fine-Tuning the Falcon 7-Billion Parameter Model with
Hugging Face accelerate
PyTorch 2.0.1
Intel Extensions for PyTorch 2.0.100

 

Date

Nov 1-20, 2023

 

Test by

 Intel

CSP and VM Config

 

 

 

Cloud

AWS

 

Region

us-east-1

 

Instance Type

c7i.metal-24xl

 

CPU(s)

48 cores

 

Microarchitecture

AWS Nitro

 

Instance Cost

4.414 USD per Hour

 

Number of Instances or VMs (if cluster)

 

Iterations and result choice (median, average, min, max)

 

Memory

 

 

 

Memory

192GB

 

DIMM Config

 

 

Memory Capacity / Instance

 

Network Info

 

 

 

Network BW / Instance

 37.5 Gbps

 

NIC Summary

 

Storage Info

 

 

 

Storage: NW or Direct Att / Instance

SSD GP2  

 

Drive Summary

 1 volume 200GB

Table 1: Tuning Infrastructure components

We then used the peft configuration based fine-tuning training and the code shown in the demonstration.[iv]

 

Category

Attribute

c7i

Run Info

 

 

 

Benchmark

Fine-Tuning Falcon 7-Billion Parameter Model with
Hugging Face accelerate
PyTorch 2.0.1
Intel Extensions for PyTorch 2.0.100

 

Dates

Nov 1-20, 2023

 

Test by

Intel

Software

 

 

 

Workload

Generative AI Fine Tuning

Workload Specific Details

 

 

 

Command Line

# Fine-tuning Falcon 7B model Training:

python vmw_tr_falcon.py --bf16 True --use_ipex True --max_seq_length 512 --num_train_epochs 1 --output_dir "./model_dist1_aws"

Table 2: Tuning Software run components.

The code used for the training was leveraged from this GitHub repository.

Tuning Results:

The Amazon EC2 C7i.metal-24xl was used to tune the model with the Guanaco dataset. The tuning was repeated multiple time and the average time recorded as shown in table 3.

 

Instance

Time taken to do fine-tuning (hh:mm:ss)

C7i.metal-24xl

05:42:41

(20561 seconds)

Table 3: Average time for Falcon-7B tuning with C7i.metal-24xl

A snippet of the actual tuning run is shown in Figure 4. The results show that larger Intel 4th Gen Xeon based Amazon EC2 instances can be effectively used to train LLMs such as the Falcon-7B in a reasonable amount of time.

 

Mohan_Potheri_1-1702768347599.png

Figure 2: Snapshot from the tuning of Falcon 7B with Xeon

We looked at the concept of taking an open-source large language model and tuning it for a specific use case leveraging the latest Intel Xeon processorsIn part 3 we will look at leveraging the latest Intel Xeon based AWS instances for inference for large language models.

References:

[i] https://huggingface.co/tiiuae/falcon-7b Falcon-7B is a 7B parameters causal decoder-only model built by TII and trained on 1,500B tokens of RefinedWeb enhanced with curated corpora. It is made available under the Apache 2.0 license.

[ii] https://huggingface.co/datasets/timdettmers/openassistant-guanaco: The Guanaco[ii] dataset is a subset of the Open Assistant dataset. This subset of the data only contains the highest-rated paths in the conversation tree, with a total of 9,846 samples.

[iii] https://www.intel.com/content/www/us/en/developer/articles/technical/fine-tune-falcon-llm-with-hugging-face-oneapi.html: Fine Tuning Falcon-7B with hugging face and Intel OneAPI.

[iv] https://www.youtube.com/watch?v=JNMVulH7fCo: Video showing techniques and code used for tuning the model

About the Author
Mohan Potheri is a Cloud Solutions Architect with more than 20 years in IT infrastructure, with in depth experience on Cloud architecture. He currently focuses on educating customers and partners on Intel capabilities and optimizations available on Amazon AWS. He is actively engaged with the Intel and AWS Partner communities to develop compelling solutions with Intel and AWS. He is a VMware vExpert (VCDX#98) with extensive knowledge on premises and hybrid cloud. He also has extensive experience with business critical applications such as SAP, Oracle, SQL and Java across UNIX, Linux and Windows environments. Mohan Potheri is an expert on AI/ML, HPC and has been a speaker in multiple conferences such as VMWorld, GTC, ISC and other Partner events.