Learn how to run inference with 7-billion and 40-billion Falcon on a 4th Gen Xeon CPU with Hugging Face Pipelines.
It’s easy to assume that the only way that we can perform inference with LLMs that are made up of billions of parameters is with a GPU. While it’s true that GPUs provide significant accelerations over CPUs in deep learning, the hardware should always be selected based on the use case. For example, suppose your end users only need a response every 30 seconds. In that case, there’s a diminishing return if you’re struggling (financially and logistically) to reserve accelerators that give you answers in < 30 seconds.
Figure 1: Working backward from end-users to hardware and software stack — thinking like a “Compute Aware AI Developer” — Image by Author.
This all comes back to a fundamental principle, being a “Compute Aware AI Developer” — working backward from the goals of your application to the right software and hardware to use. Imagine starting a home project like hanging a new shelf and going straight for the sledgehammer without even considering that a smaller and more precise hammer would be the right tool for the project.
In this article, we will perform inference with Falcon-7b and Falcon-40b on a 4th Generation Xeon CPU using Hugging Face Pipelines. Falcon-40b is a 40-billion parameter decoder-only model developed by the Technology Innovation Institute (TII) in Abu Dhabi. It outperforms several models like LLaMA, StableLM, RedPajama, and MPT, utilizing the FlashAttention method to achieve faster and optimized inference, resulting in significant speed improvements across different tasks.
Environment Setup
Once you have accessed your Xeon compute instance, you must secure enough storage to download the checkpoints and model shards for Falcon. We recommend securing at least 150 GB of storage if you want to test both the 7-billion and 40-billion Falcon versions. You must also provide enough RAM to load the model into memory and cores to run the workload efficiently. We successfully ran the 7-billion and 40-billion Falcon versions on a 32-core 64GB RAM VM (4th Gen Xeon) on the Intel Developer Cloud. However, this is one of many valid compute specifications, and further testing would likely improve performance.
- Install the latest version of miniconda.
- Create a conda environment conda create -n falcon python==3.8.10
- Install dependencies pip install -r requirements.txt. You can find the contents requirements.txt file below.
transformers==4.29.2
torch==2.0.1
accelerate==0.19.0
einops==0.6.1 - Activate your conda environment conda activate falcon
Running Falcon with Hugging Face Pipelines
Hugging Face pipelines provide a simple and high-level interface for applying pre-trained models to various natural language processing (NLP) tasks, such as text classification, named entity recognition, text generation, and more. These pipelines abstract away the complexities of model loading, tokenization, and inference, allowing users to quickly utilize state-of-the-art models for NLP tasks with just a few lines of code.
Below is a convenient script you can run in the cmd/terminal to experiment with the raw pre-trained Falcon models.
from transformers import AutoTokenizer, AutoModelForCausalLM
user_input = "start" while user_input != "stop": user_input = input(f"Provide Input to {model} parameter Falcon (not tuned): ") if user_input != "stop": sequences = generator( inference_time = time.time() - start if __name__ == "__main__": # falcon-demo.py |
To run the script (falcon-demo.py) You must provide the script and various parameters:
| python falcon-demo.py --falcon_version "7b" --max_length 25 --top_k 5 |
The script has 3 optional parameters to help control the execution of the Hugging Face pipeline:
- falcon_version: allows you to select from Falcon’s 7 billion or 40 billion parameter versions.
- max_length: used to control the maximum length of the generated text in text generation tasks.
- top_k: specifies the number of highest probability tokens to consider at each step.
You can hack the script to add/remove/edit the parameters. What is important is that you now have access to one of the most powerful open-source models ever released!
Playing with Raw Falcon
Raw Falcon is not tuned for any particular purpose, so it will likely spew nonsense (Figure 2). Still, this doesn’t stop us from asking a few questions to test it out. When the script is done downloading the model and creating the pipeline, you will be prompted to provide input to the model. When you’re ready to stop, type “stop”.
| Setting ‘pad_token_id’ to ‘eos_token_id’:11 for open-end generation. Result: hello, how are you? (I’m fine, thank you.) - ¿Cómo está usted Provide Input to tiiuae/falcon-7bparameter Falcon (not tuned): |
Figure 2. Command line interface inference test of 7 Billion Parameter Falcon Model on Intel 4th Gen Xeon with default script parameters — Image by Author
The script prints the inference time to give you an idea of how long the model takes to respond based on the current parameters provided to the pipeline and the compute you have made available to this workload.
Tip: You can significantly alter the inference time by adjusting the max_length parameter.
This tutorial is designed to share how to get Falcon running on a CPU with Hugging Face Transformers but does not explore options for further optimizations on Intel CPUs. Libraries like the Intel Extension for Transformers offer capabilities to accelerate Transformer-based models through techniques like quantization, distillation, and pruning. Quantization is a widely-used model compression technique that can reduce the model size and improve inference latency — this would be a valuable next step to explore enhancing the performance of this workflow.
Summary and Discussion
Foundational LLMs create opportunities for developers to build exciting AI applications. However, half the battle is usually finding a model with the correct license that allows for commercial derivatives. Falcon presents a rare opportunity because it intersects performance and license flexibility.
Although Falcon is fairly democratized from an open-source perspective, its size creates new challenges for engineers/enthusiasts. This tutorial helped address this by combining Falcon’s “truly open” license, Hugging Face Pipelines, and the availability/accessibility of CPUs to give developers more access to this powerful model.
A few exciting things to try would be:
- Fine-tune Falcon to a specific task by leveraging the Intel Extension for PyTorch.
- Use model compression tools available in Intel Neural Compressor (INC) and Intel Extension for Transformers.
- Play with the parameters of Hugging Face pipelines to optimize performance for your particular use case.
If you are interested in testing out the tutorial in this blog, visit the Intel Developer Cloud and get started on the free jupyter instances under the "Workshops and Tutorials" section. Also consider spinning up a VM for more compute control.
Notices and Disclaimers
Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software, or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.