Tuning and Inference for Generative AI with 4th Generation Intel Xeon Processors (Part 2 of 3)

Mohan_Potheri · ‎12-19-2023

We introduced Generative AI tuning and inference concepts in part 1 of the blog series. In this part 2 we will look at the concept of taking an open-source large language model and tuning it for a specific use case leveraging the latest Intel Xeon processors.

Base Model: The Falcon-7B

The Falcon series represents a cutting-edge collection of language models developed by the Technology Innovation Institute in Abu Dhabi. Released under the Apache 2.0 license, Falcon-7B [i]stands out as the inaugural "truly open" model, boasting capabilities that rival numerous existing closed-source models. This development is highly promising for practitioners, enthusiasts, and industries alike, as it paves the way for a plethora of exciting use cases.

The Falcon-7B operates effectively with only around 15GB of memory, making it conducive to inference and fine-tuning even on consumer-grade hardware.

Figure 1: Falcon Open-Source Generative AI LLM (Image Source: Voicebot )

Additionally, Technology Innovation Institute (TII) has introduced instruct versions of these models, namely Falcon-7B-Instruct and Falcon-40B-Instruct. These experimental variants have undergone fine-tuning with a focus on instructional and conversational data, making them particularly suited for popular assistant-style tasks. For those seeking a quick engagement with the models, these instruct versions present an optimal solution. Moreover, individuals can craft their own custom instruct versions based on the diverse datasets cultivated by the community—an upcoming section will provide a step-by-step tutorial.

The training regimen for Falcon-7B involved a substantial 1.5 trillion tokens, aligning with contemporary models optimized for inference. A pivotal factor contributing to the superior quality of Falcon models lies in their training data, with over 80% derived from RefinedWeb—an innovative, expansive web dataset rooted in Common Crawl. Departing from the traditional approach of compiling data from various curated sources, TII prioritized scaling and enhancing the quality of web data.

Fine-tuning dataset:

The Guanaco[ii] dataset is a subset of the Open Assistant dataset. This subset of the data only contains the highest-rated paths in the conversation tree, with a total of 9,846 samples. This dataset was used to train Guanaco with QLoRA. IPEX with Intel AMX and AMP with Bfloat16 as given in the Intel site.

Tuning Falcon 7B with Amazon EC2 c7i instances:

The pre-existing Falcon 7B Model was used for the tuning exercise. Example given in this article [iii] was used for the Generative AI tuning leveraging Amazon EC2 c7i.metal-24xl instances.

Streamlining and enhancing the fine-tuning process for Falcon-7B is achieved with good efficiency through the integration of SFTTrainer, Intel PyTorch Extensions (IPEX) along with Intel AMX, and AMP with Bfloat16. The intricate tasks involved in fine-tuning are simplified through SFTTrainer, which offers a higher-level abstraction, thereby simplifying the overall process. This approach ensures a more streamlined and efficient fine-tuning experience for Falcon-7B.

Leveraging the cutting-edge hardware features embedded in Intel Xeon processors, IPEX and AMX play pivotal roles in optimizing the fine-tuning process. This integration not only streamlines the task at hand but also introduces support for the latest optimizations and devices even before they are integrated into the open-source PyTorch*. Additionally, it facilitates AMP training and inference, where parameters and operations are converted to Bfloat16, thereby not only accelerating Intel AMX but also preserving full 32-bit accuracy when needed. This comprehensive approach ensures both efficiency and effectiveness in the fine-tuning of Falcon-7B, leveraging the advanced capabilities of Intel Xeon processors.

Category	Attribute	c7i
Run Info
	Benchmark	Fine-Tuning the Falcon 7-Billion Parameter Model with Hugging Face accelerate PyTorch 2.0.1 Intel Extensions for PyTorch 2.0.100
	Date	Nov 1-20, 2023
	Test by	Intel
CSP and VM Config
	Cloud	AWS
	Region	us-east-1
	Instance Type	c7i.metal-24xl
	CPU(s)	48 cores
	Microarchitecture	AWS Nitro
	Instance Cost	4.414 USD per Hour
	Number of Instances or VMs (if cluster)	1
	Iterations and result choice (median, average, min, max)
Memory
	Memory	192GB
	DIMM Config
	Memory Capacity / Instance
Network Info
	Network BW / Instance	37.5 Gbps
	NIC Summary
Storage Info
	Storage: NW or Direct Att / Instance	SSD GP2
	Drive Summary	1 volume 200GB

Table 1: Tuning Infrastructure components

We then used the peft configuration based fine-tuning training and the code shown in the demonstration.[iv]

Category	Attribute	c7i
Run Info
	Benchmark	Fine-Tuning Falcon 7-Billion Parameter Model with Hugging Face accelerate PyTorch 2.0.1 Intel Extensions for PyTorch 2.0.100
	Dates	Nov 1-20, 2023
	Test by	Intel
Software
	Workload	Generative AI Fine Tuning
Workload Specific Details
	Command Line	*# Fine-tuning Falcon 7B model Training:* *python vmw_tr_falcon.py --bf16 True --use_ipex True --max_seq_length 512 --num_train_epochs 1 --output_dir "./model_dist1_aws"*

Table 2: Tuning Software run components.

The code used for the training was leveraged from this GitHub repository.

Tuning Results:

The Amazon EC2 C7i.metal-24xl was used to tune the model with the Guanaco dataset. The tuning was repeated multiple time and the average time recorded as shown in table 3.

Instance

Time taken to do fine-tuning (hh:mm:ss)

C7i.metal-24xl

05:42:41

(20561 seconds)

Table 3: Average time for Falcon-7B tuning with C7i.metal-24xl

A snippet of the actual tuning run is shown in Figure 4. The results show that larger Intel 4th Gen Xeon based Amazon EC2 instances can be effectively used to train LLMs such as the Falcon-7B in a reasonable amount of time.

Figure 2: Snapshot from the tuning of Falcon 7B with Xeon

We looked at the concept of taking an open-source large language model and tuning it for a specific use case leveraging the latest Intel Xeon processors. In part 3 we will look at leveraging the latest Intel Xeon based AWS instances for inference for large language models.

References:

[i] https://huggingface.co/tiiuae/falcon-7b Falcon-7B is a 7B parameters causal decoder-only model built by TII and trained on 1,500B tokens of RefinedWeb enhanced with curated corpora. It is made available under the Apache 2.0 license.

[ii] https://huggingface.co/datasets/timdettmers/openassistant-guanaco: The Guanaco[ii] dataset is a subset of the Open Assistant dataset. This subset of the data only contains the highest-rated paths in the conversation tree, with a total of 9,846 samples.

[iii] https://www.intel.com/content/www/us/en/developer/articles/technical/fine-tune-falcon-llm-with-hugging-face-oneapi.html: Fine Tuning Falcon-7B with hugging face and Intel OneAPI.

[iv] https://www.youtube.com/watch?v=JNMVulH7fCo: Video showing techniques and code used for tuning the model