Accelerating Codegen training and inference on Habana Gaudi2

Vasudev_Lal · ‎09-06-2023

By Tiep Le, Ke Ding, Vasudev Lal, Yi Wang, Matrix Yao, and Phillip Howard

Optimum Habana makes it easy to achieve fast training and inference of large language models (LLMs) on Habana Gaudi2 accelerators. In this blog, we will walk through the process of performing Low-Rank Adaptation (LoRA) training of Codegen, an open-source LLM for program synthesis. We will also benchmark te training and inference efficiency of Habana Gaudi2 using Codegen.

Codegen

Codegen is a family of transformer-based autoregressive language models trained with the standard next-token prediction language modeling objective. Developed and released by Salesforce AI Research, Codegen is available in a range of sizes (350M, 2.7B, 6.1B, and 16B parameters) and variants which differ in the datasets used for training.

Codegen-NL was trained on The Pile, an 825 GB natural language dataset comprised of 22 smaller datasets. Starting from Codegen-NL, Codegen-Multi is further trained on a subset of the BigQuery dataset, which contains open-source code from six programming languages (C, C++, Go, Java, JavaScript, and Python). Finally, Codegen-Mono is initialized from Codegen-Multi and then trained on a large dataset of Python code called BigPython.

We will use the largest variant of Codegen-Mono in this blog (Codegen-Mono-16B), which we hereafter refer to simply as Codegen.

LoRA

Through additional finetuning, pre-trained LLMs are often adapted to tasks that differ from those represented in their training datasets. However, finetuning all 16B parameters in a model such as Codegen is resource intensive and is unnecessary in most cases. Instead of fully finetuning all model parameters, parameter-efficient finetuning (PEFT) methods seek to adapt pre-trained LLMs by only learning a small number of incremental weight updates necessary for the adaptation task.

Low-Rank Adaptation (LoRA) is one approach for PEFT which has recently gained considerable popularity. LoRA parameterizes increment weight updates to an LLM using two low-rank matrices, where the rank controls the total number of trainable parameters during finetuning. As depicted in the figure below, incremental weight updates represented by the low-rank matrices are added to the pre-trained LLM weights during each forward pass.

Low-rank decomposition utilized by LoRA
Source: LoRA: Low-Rank Adaptation of Large Language Models

LoRA enables models such as Codegen to be efficiently adapted to new datasets and tasks without requiring distributed training across multiple accelerators or GPUs.

Hardware

Gaudi2 is an AI accelerator developed by Habana Labs for state-of-the-art deep learning training and inference performance on AI workloads. Gaudi2 accelerators (which are referred to in code as “HPUs”) have 96 GB of integrated memory and are available in servers containing 8 Gaudi2 mezzanine cards via the Intel Developer Cloud (IDC), as well as available for on-premises infrastructure from Supermicro and IEI. For detailed instructions on getting started with Gaudi2 on IDC, check out this Huggingface Blog.

In this blog, we use Gaudi2 accelerators running on an IDC server for both training and inference.

Training

Our training and inference examples utilize Optimum Habana, an interface between the Huggingface Transformers and Diffusers libraries and Habana Gaudi accelerators. To get started, we will clone the Optimum Habana library and install it along with the dependencies required by our examples:

git clone https://github.com/huggingface/optimum-habana.git
pip install –e optimum-habana/
pip install –r optimum-habana/examples/language-modeling/requirements.txt
pip install –r optimum-habana/examples/text-generation/requirements.txt

We will use samples from the sql-create-context dataset for finetuning. The sql-create-context dataset includes 78,577 examples of natural language queries, SQL CREATE TABLE statements, and corresponding SQL queries which answer the question using the CREATE statement as context. An example of a single instance from this dataset is provided below [sic]:

{
"question": "The 'firs park' stadium had the lowest average attendence of what?",
"context": "CREATE TABLE table_11206916_1 (average INTEGER, stadium VARCHAR)",
"answer": "SELECT MIN(average) FROM table_11206916_1 WHERE stadium = 'Firs Park'"
}

We will finetune Codegen-Mono using a random 20% sample of the sql-create-context dataset, resulting in a total of 15716 samples for finetuning. First, we can download, split, and write the training set to a file named "train-sql-create-context.json":

from datasets import load_dataset
dataset = load_dataset('b-mc2/sql-create-context')
ds_train_test = dataset['train'].train_test_split(test_size=0.2)
ds_train_test['test'].to_json('./data-for-finetune/train-sql-create-context.json')

After preparing the training dataset, we can begin LoRA finetuning. To finetune using a single Gaudi2 accelerator, we can call run_lora_clm.py in Optimum Habana as follows:

cd optimum-habana/examples/language-modeling/
python run_lora_clm.py     \
    --model_name_or_path Salesforce/codegen-16B-mono \
    --train_file "./data-for-finetune/train-sql-create-context.json" \
    --report_to "tensorboard" \
    --bf16 True \
    --output_dir ./finetuned-models/codegen-on-sql-create-context-hpu1-lora8-bs4 \
    --num_train_epochs 5 \
    --per_device_train_batch_size 4 \
    --evaluation_strategy "no" \
    --save_strategy "no" \
    --learning_rate 1e-4 \
    --logging_steps 1 \
    --dataset_concatenation \
    --do_train --use_habana --use_lazy_mode \
    --throughput_warmup_steps 3 \
    --lora_target_modules "qkv_proj" \
    --lora_rank 8 \
    --cache_dir /codegen/cache/

For faster training, we can train the model with multiple Gaudi2 processors using DeepSpeed. For example, the same training job can be launched with 8 Gaudi2s by simply calling gaudi_spawn.py:

cd optimum-habana/examples/language-modeling/
python ../gaudi_spawn.py \
    --world_size 8 --use_deepspeed run_lora_clm.py \
    --model_name_or_path Salesforce/codegen-16B-mono \
    --train_file "./data-for-finetune/train-sql-create-context.json" \
    --report_to "tensorboard" \
    --bf16 True \
    --output_dir ./finetuned-models/codegen-finetune-on-sql-create-context-hpu8-lora8-bs4 \
    --num_train_epochs 5 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --evaluation_strategy "no" \
    --save_strategy "no" \
    --learning_rate 1e-4 \
    --logging_steps 1 \
    --dataset_concatenation \
    --do_train \
    --use_habana \
    --use_lazy_mode \
    --throughput_warmup_steps 3 \
    --cache_dir /codegen/cache/ \
    --use_hpu_graphs_for_inference \
    --lora_target_modules "qkv_proj" \
    --lora_rank 8 \
    --deepspeed deepspeed_config.json

For the above example of finetuning on 8 Gaudi2s, we utilized the following DeepSpeed configuration:

{
    "steps_per_print": 64,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "bf16": {
        "enabled": true
    },
    "gradient_clipping": 1.0,
    "zero_optimization": 
        "stage": 2,
        "overlap_comm": false,
        "reduce_scatter": false,
        "contiguous_gradients": false
    }
}

Leveraging multiple Gaudi2s with DeepSpeed significantly accelerates training. Below we show the relationship between the number of Gaudi2s used and the total training time for the finetuning experiment described above.

Inference

Now that we have adapted Codegen using LoRA, we can investigate how our training has changed the model’s generations. We will evaluate Codegen before and after LoRA finetuning using the following query [sic]:

You are a text-to-SQL model. Your job is to answer questions about a database. You are given a question and a context regarding one or more tables in the database.

You must output the SQL query that answers the question. The SQL query must be between [SQL] and [/SQL] tags.

### Question:
The 'firs park' stadium had the lowest average attendence of what?

### Context:
CREATE TABLE table_11206916_1 (average INTEGER, stadium VARCHAR)

### Response:

We can use run_generation.py in Optimum Habana to generate a completion to this test query using our LoRA finetuned Codegen model:

cd optimum-habana/examples/text-generation
python run_generation.py \
--model_name_or_path "Salesforce/codegen-16B-mono" \
--peft_model "../language-modeling/finetuned-models/codegen-on-sql-create-context-hpu1-lora8-bs4" \
--max_new_tokens 100 --bf16 --use_hpu_graphs --use_kv_cache \
--prompt "You are a text-to-SQL model. Your job is to answer questions about a database. You are given a question and a context regarding one or more tables in the database.

You must output the SQL query that answers the question. The SQL query must be between [SQL] and [/SQL] tags.

### Question:
The 'firs park' stadium had the lowest average attendence of what?

### Context:
CREATE TABLE table_11206916_1 (average INTEGER, stadium VARCHAR)

### Response:"

This results in the following generated response, which correctly answers the question:

### Response:
[SQL]SELECT MIN(average) FROM table_11206916_1 WHERE stadium = "Firs Park"[/SQL]

Now let’s try the same test query using the original Codegen:

cd optimum-habana/examples/text-generation
python run_generation.py \
--model_name_or_path "Salesforce/codegen-16B-mono" \
--max_new_tokens 100 --bf16 --use_hpu_graphs --use_kv_cache \
--prompt "You are a text-to-SQL model. Your job is to answer questions about a database. You are given a question and a context regarding one or more tables in the database.

You must output the SQL query that answers the question. The SQL query must be between [SQL] and [/SQL] tags.

### Question:
The 'firs park' stadium had the lowest average attendence of what?

### Context:
CREATE TABLE table_11206916_1 (average INTEGER, stadium VARCHAR)

### Response:"

As we can see in its generated response below, the original Codegen model fails to produce correct SQL code for answering the question:

### Response:
SELECT stadium, AVG(average)
FROM table_11206916_1
GROUP BY stadium
HAVING AVG(average) = (SELECT MIN(average)
FROM table_11206916_1)

This example shows how our LoRA adaptation successfully improved Codegen’s ability to generate completions to a programming language not seen in its pre-training dataset.

In addition to LoRA finetuning of LLMs, Optimum Habana also includes optimizations of models such as Codegen for fast inference performance on Habana Gaudi accelerators. Using our LoRA finetuned Codgen model and the run_generation.py script referenced above, we can achieve the following throughput speeds for various batch sizes:

Conclusion

In this blog, we showed how finetuning LLMs with LoRA is fast and easy using Habana Gaudi2 accelerators with Optimum Habana. Although the examples in this blog used Codegen, the LoRA finetuning and inference scripts in Optimum Habana are broadly compatible with other LLMs. Check out Intel Developer Cloud to deploy your future AI workloads on Habana Gaudi2.

This blog was collaboration between SATG's AIA (Advanced AI and Analytics) team and Intel Labs' Cognitive AI team.