GenAI Essentials: Inference with Falcon-7B and Zephyr-7B

bconsolvo · ‎12-05-2023

Author: Benjamin Consolvo
Date: November 28, 2023

Published first on Medium.com

Cover Photo: Generated from prompt “Zephyr bird in a future tech world with blue hues” with Stable Diffusion XL (https://huggingface.co/spaces/google/sdxl).

It is no secret that open-source large language models (LLMs) like Falcon-7B and Zephyr-7B have made building your own conversational AI system highly performant while also accessible to the broader AI developer community and efficient on smaller hardware platforms. The “7B” tag indicates that these are 7-billion parameter models, which are on the smaller end of LLMs compared to 13-billion and 70-billion+ parameter models. Currently, GPT-4 outperforms these smaller models, but I believe that in 2024, we will see the gap closing between the open-source small models and the large closed-source models. In this article, I briefly introduce the two aforementioned open-source 7B models and how to get started using them on the latest Intel GPU.

Falcon-7B

Falcon-7B was built by the Technology Innovation Institute and is a raw pre-trained model. On its own, it is not directly suitable for chat. But after fine-tuning it to your dataset, it is a highly capable LLM suitable for chat, text classification, question answering, and other text generation tasks. It is an open-source model that uses Apache License, Version 2.0, making it accessible to the broader public, both for academic and commercial use.

Falcon-7B was trained on a dataset called RefinedWeb in English and French languages only. It is a whopping 2.8TB-sized web-only dataset consisting of 968M individual web pages, filtered from a large-scale dataset from CommonCrawl. It was also used to train Falcon-40B, a larger and more powerful LLM from the same family as Falcon-7B. For more information on the training data, check out the Hugging Face link here.

Zephyr-7B

Zephyr-7B, a model fine-tuned from Mistral-7B, was built by Hugging Face and distributed with the MIT License. Unlike Falcon-7B, Zephyr is already fine-tuned for chat and is ready out of the box for a chat application. Zephyr-7B was fine-tuned using the UltraChat and the UltraFeedback datasets, described in more detail below.

UltraChat

The UltraChat dataset is a synthetic dataset of dialogues generated by ChatGPT. For example, one synthetic dialogue sequence has:

How do you think innovation differs between large and small companies?

As an AI language model, I don’t have personal views or opinions — but I can provide information based on my training data. Innovation can differ between large and small companies in several ways: Resources: Large companies often have more resources available to invest in research and development, experimentation, and prototyping. On the other hand, small companies may have limited resources and must prioritize innovation efforts more carefully. …

This is only a snippet of the dialogue, as each training example is quite lengthy. There are 774K such dialogue examples, and they span a wide variety of topics. As pointed out in the dataset description, the dialogue includes “rewriting, continuation, summarization, and inference.”

UltraFeedback

The UltraFeedback dataset is a collection of 64K prompts from a wide range of models, including GPT-3.5 Turbo, MPT-30B-Chat, Alpaca 7B, Pythia-12B, StarChat, and others. Four different responses are generated for each prompt, meaning there are a total of 256K samples. GPT-4 is then used to annotate the collected samples.

Getting started on the Intel Developer Cloud

You can get started for free with a Jupyter* Notebook hosted on the Intel® Developer Cloud so you can run the LLM examples yourself using the latest Intel AI hardware together with Intel-optimized AI software. It gives you the option of using the models listed above. I just added these two models to the existing Simple LLM Inference notebook so that you can get started immediately with these. Just click on the Launch button under “Simple LLM Inference: Playing with Language Models” on the home page to open up the Jupyter Notebook to get started (Figure 1).

Fig 02.png

Figure 1: Launching the LLM Inference Jupyter Notebook on the Intel Developer Cloud home page. Image by Author.

Notes on the code

All required Python* frameworks come pre-installed on the Intel Developer Cloud instance, including transformers, torch, and intel_extension_for_pytorch. Load the Zephyr-7B and Falcon 7-B models with the usual transformers framework:

from transformers import AutoModelForCausalLM, AutoTokenizer

Here is where the actual tokenizer and model is instantiated:

self.tokenizer = AutoTokenizer.from_pretrained(
     model_id_or_path,
     trust_remote_code=True,
     cache_dir="/home/common/data/Big_Data/GenAI/"
)
self.model = (
     AutoModelForCausalLM.from_pretrained(
         model_id_or_path,
         low_cpu_mem_usage=True,
         trust_remote_code=True,
         torch_dtype=torch.bfloat16,
         cache_dir="/home/common/data/Big_Data/GenAI/",
    )
     .to(self.device)
     .eval()
)

To get the most out of the latest Intel® Data Center Series GPU Max 1100, both PyTorch and Intel® Extension for PyTorch come pre-installed in the conda environment pytorch-gpu that is loaded with the notebook. You can visit the GitHub links to install these on your own instances if needed.

The two key functions that are used with Intel Extension for PyTorch are:

ipex.optimize_transformers(self.model, dtype=torch.bfloat16)

and

ipex.optimize(self.model, dtype=torch.bfloat16)

Where self.model is the loaded LLM model, and the data type is torch.bfloat16 to boost performance by using a smaller data type on the Intel GPU. The nice thing about this extension is that there is very little modification of code that you would need to do when coming from another platform. Changing the device to xpu and these small code changes are all you should need.

Summary

Falcon-7B and Zephyr-7B are smaller LLMs when compared to their much larger model equivalents (e.g., Falcon-180B) but deliver performant and efficient inference. Falcon-7B is an example of a model that can be fine-tuned for many text tasks, including chat, text classification, and question answering. Zephyr-7B was already fine-tuned from another model called Mistral-7B, and it works great out of the box for chat. Both models can be used on the Intel Developer Cloud with the provided sample Jupyter Notebook by clicking “Simple LLM Inference: Playing with Language Models” on the home page after registering. You are welcome to try these models and bring your own models from Hugging Face. I look forward to hearing about your experience with these models on the Intel Developer Cloud.

You can reach me on the Intel DevHub Discord server, LinkedIn, or Twitter.
Thank you for reading.

Disclaimer for Using Large Language Models

Please be aware that while LLMs like Falcon-7B and Zephyr-7B are powerful tools for text generation, they may sometimes produce results that are unexpected, biased, or inconsistent with the given prompt. It’s advisable to carefully review the generated text and consider the context and application in which these models are used.
Usage of these models must also adhere to their licensing agreements and be in accordance with ethical guidelines and best practices for AI. If you have any concerns or encounter issues with the models, please refer to the respective model cards and documentation provided in the links above.