Authors: Rajashekar Kasturi (Senior AI Engineer) and Rahul Unnikrishnan Nair (Head of Engineering, Intel® Liftoff)
The rise of Large Language Models (LLMs) has significantly expanded what’s possible in natural language applications, but effectively deploying these massive models for production inference remains a significant challenge. As models like Llama 3.3-70B push the boundaries of what's possible in natural language understanding and generation, organizations need efficient, scalable, and cost-effective inference solutions.
Text Generation Inference (TGI) from Hugging Face has emerged as a leading solution for deploying LLMs in production environments. TGI is a production-ready inference system with support for features like:
- Continuous batching for maximizing throughput
- Token streaming for responsive user experiences
- Tensor parallelism for distributing model weights across multiple accelerators
- Quantization support for reduced memory footprint
- Flash Attention and other optimizations for faster inference
This guide shows how to deploy TGI on Intel® Gaudi® 2 AI Accelerators to efficiently deploy and serve Meta's Llama 3.3-70B model. Gaudi accelerators provide a viable hardware option for LLM inference, with competitive performance and cost.
We'll demonstrate a practical implementation using 4 cards of an 8-card Gaudi 2 node, showcasing how to effectively partition resources for multiple workloads or users on a single system.
Understanding Llama 3.3-70B and Its Capabilities
Llama 3.3-70B is Meta’s instruction-tuned large language model from the Llama 3 family, building on the foundation of Llama 3.1 with additional post-training improvements. It is optimized for instruction following, multilingual reasoning, and structured output generation.
Key capabilities include:
- Strong performance on academic benchmarks, including 86.0% accuracy on MMLU (Massive Multitask Language Understanding)
- 128K token context window, enabling long-sequence processing
- Improved instruction tuning, leading to more reliable responses
- Structured output support, including JSON and function call formats
The model maintains the same architecture as Llama 3.1-70B but incorporates the latest advancements in post-training techniques, resulting in significantly better evaluation performance while maintaining efficient inference characteristics.
Step 1: Configure Instance and Access
To begin our implementation, we need to provision and access an Intel® Gaudi® 2 AI accelerator instance. The Intel® Tiber™ AI Cloud provides easy access to Intel® Gaudi® 2 AI accelerators.
Launching an Intel® Gaudi® 2 AI accelerator Instance
- Create an account on Intel Tiber AI Cloud
- Visit the Intel® Tiber™ AI Cloud console
- Sign up for an account if you don't already have one
- Generate SSH Keys
- Create an SSH key pair for secure access to your instance
- Save the private key securely on your local machine
- Add the public key to your Intel Tiber AI Cloud account
- Launch an Intel® Gaudi® 2 AI accelerator Instance
- Select the instance type with 8 cards
- Choose the OS image with the latest SynapseAI* Software Suite version (v1.20.1 or newer)
- Select your SSH key for authentication
- Optionally enable Jupyter Notebook access if needed
- Connect to Your Instance
- Once the instance is in the "Ready" state, copy the SSH command
- Connect from your terminal.
- Verify Accelerator Availability
- Run `hl-smi` to check the status of your Gaudi cards
- You should see 8 Gaudi 2 cards available
Intel® Gaudi® 2 AI accelerator cards status via hl-smi command
Understanding Intel® Gaudi® 2 AI accelerator Architecture
Intel® Gaudi® 2 accelerators are purpose-built for deep learning workloads with a specialized architecture designed for AI training and inference. Key features include:
- 24 Tensor Processor Cores (TPCs) for efficient matrix operations
- 96GB of HBM2E memory with 2.45TB/sec bandwidth
- Integrated 100GbE RoCE RDMA NICs for scalable multi-card configurations
- Dedicated Matrix Multiplication Engines (MME) for accelerating deep learning operations
- Support for mixed precision (FP32, BF16, FP16, INT8) computation
This architecture makes Intel® Gaudi® 2 AI accelerator a strong fit for LLM inference workloads, offering a balance of performance and cost-effectiveness.
Step 2: Setting Up Text Generation Inference for Gaudi
Hugging Face's (TGI) has been optimized to run efficiently on Gaudi hardware through a dedicated backend. This integration enables efficient LLM serving on Intel's AI accelerators.
Understanding TGI on Gaudi
The Gaudi backend for TGI provides several key optimizations:
- HPU Graph optimization for faster execution
- Flash Attention support for efficient attention computation
- FP8 precision inference for reduced memory footprint
- Tensor parallelism for distributing model weights across multiple cards
- Continuous batching for maximizing throughput
Preparing Your Environment
The easiest way to run TGI on Gaudi is to use the official Docker image. Let's set up our environment variables:
# Specify the TGI Gaudi image version
export IMAGE_NAME=ghcr.io/huggingface/text-generation-inference:latest-gaudi
# Name your container
export CONTAINER_NAME=llama3-70b
# Specify the model to deploy
export MODEL=meta-llama/Llama-3.3-70B-Instruct
# Set your Hugging Face token (required for accessing gated models)
export HF_TOKEN=<hf_token>
# Create a volume for persistent storage
export VOLUME=$PWD/data
**Note:** You can find various TGI image tags [here](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference). For production deployments, it's recommended to use a specific version tag rather than `latest-gaudi`.
Supported Models
TGI on Gaudi supports a wide range of LLMs, including:
- Llama 2/3/3.1/3.3 (various sizes)
- Mixtral-8x7B
- Mistral-7B
- Qwen2 models
- Gemma models
- Falcon models
- And many more
Check the complete list of supported models in the TGI documentation.
Authentication Setup
To access gated models like Llama 3.3-70B, you'll need a Hugging Face token:
- Create or log in to your Hugging Face account
- Navigate to [Settings > Access Tokens](https://huggingface.co/settings/tokens)
- Generate a new token with read permissions
- Set this token as the `HF_TOKEN` environment variable
Volume Mounting Strategy
To avoid downloading model weights every time you run the container, we'll mount a local directory to the container.|
Understanding TGI Parameters
Before deploying the model, it's important to understand the key parameters that could affect performance:
- Sequence length parameters:
- `--max-input-length`: Maximum possible input prompt length (default: 4095)
- `--max-total-tokens`: Maximum possible total sequence length (input + output) (default: 4096)
- Batch size parameters:
- `--max-batch-prefill-tokens`: Set as batch_size * max-input-tokens
- `--max-batch-size`: Maximum batch size for decode operations
- Performance and memory parameters:
- `ENABLE_HPU_GRAPH`: Enables HPU graphs usage (crucial for performance)
- `LIMIT_HPU_GRAPH`: Limits HPU graph memory usage
- `USE_FLASH_ATTENTION`: Enables optimized attention implementation
- `FLASH_ATTENTION_RECOMPUTE`: Controls memory vs. computation tradeoff
Step 3: Deploying Llama 3.3-70B on Intel® Gaudi® 2 AI accelerator
Now we're ready to deploy the Llama 3.3-70B model on our Intel® Gaudi® 2 AI accelerator instance. In this example, we'll use only 4 of the 8 available Gaudi cards, demonstrating how to partition resources for multiple workloads on a single system.
Resource Allocation Strategy
By using the HABANA_VISIBLE_DEVICES environment variable, we can control which Gaudi cards are accessible to our container. This approach allows for:
- Resource partitioning for multiple independent workloads
- Isolation between different model deployments
- Efficient utilization of available hardware
Model Sharding Configuration
For large models like Llama 3.3-70B, we need to distribute the model weights across multiple accelerators using tensor parallelism. This is configured with:
- `--sharded true`: Enables model sharding
- `--num-shard 4`: Distributes the model across 4 Gaudi cards
Deployment Command
Here's the complete command to deploy Llama 3.3-70B on 4 Gaudi cards:
docker run -it --runtime=habana \
-e HABANA_VISIBLE_DEVICES=0,1,2,3 \
--cap-add=sys_nice \
--net=host \
--ipc=host \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
-e HF_TOKEN=$HF_TOKEN \
-e ENABLE_HPU_GRAPH=true \
-e LIMIT_HPU_GRAPH=true \
-e USE_FLASH_ATTENTION=true \
-e FLASH_ATTENTION_RECOMPUTE=true \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $VOLUME:/data \
--name $CONTAINER_NAME \
$IMAGE_NAME \
--model-id $MODEL \
--port 8000 \
--sharded true --num-shard 4 \
--max-input-length 1024 \
--max-total-tokens 2048 \
--max-batch-size 128 \
--max-waiting-tokens 10 \
--max-concurrent-requests 128 \
--cuda-graphs 0
TGI server starting up
Understanding Key Parameters
Let's break down the important parameters in this deployment command:
Docker and Gaudi Configuration
- `--runtime=habana`: Specifies the Habana runtime for Gaudi support
- `-e HABANA_VISIBLE_DEVICES=0,1,2,3`: Limits the container to use only the first 4 Gaudi cards
- `--cap-add=sys_nice`: Allows the container to adjust process priorities
- `--net=host`: Uses the host network stack for simplified networking
- `--ipc=host`: Uses the host IPC namespace for shared memory
TGI Optimization Parameters
- `-e ENABLE_HPU_GRAPH=true`: Enables HPU graph optimization for better performance
- `-e LIMIT_HPU_GRAPH=true`: Prevents excessive memory usage by HPU graphs
- `-e USE_FLASH_ATTENTION=true`: Enables optimized attention implementation
- `-e FLASH_ATTENTION_RECOMPUTE=true`: Trades computation for memory efficiency
Model Serving Parameters
- `--port 8000`: Sets the HTTP port for the TGI server
- `--max-input-length 1024`: Limits input prompts to 1024 tokens
- `--max-total-tokens 2048`: Sets the maximum combined input+output tokens
- `--max-batch-size 128`: Controls the maximum batch size for inference
- `--max-concurrent-requests 128`: Limits the number of concurrent requests
- `--max-waiting-tokens 10`: Controls the token generation pipeline
TGI 3.0+ Zero-Config Approach
With TGI version 3.0 and above, Hugging Face has introduced a "zero-config" approach that automatically optimizes parameters based on the hardware and model. For many deployments, you can simplify the command to:
# Simplified deployment with automatic optimizations
docker run -it --runtime=habana \
-e HABANA_VISIBLE_DEVICES=0,1,2,3 \
--cap-add=sys_nice \
--net=host \
--name $CONTAINER_NAME \
$IMAGE_NAME \
--model-id $MODEL \
--sharded true --num-shard 4
TGI now carefully evaluates the hardware and model to select optimal values for parameters like batch sizes and token limits. According to Hugging Face, removing most configuration flags often results in the best performance for most scenarios. The system dynamically adjusts to provide optimal throughput and memory usage.
Deployment Progress
When you run this command, TGI will:
- Download the model weights (if not already cached)
- Load the model across the 4 Gaudi cards
- Perform warmup operations to optimize performance
- Start the HTTP server on port 8000
The entire process may take several minutes, especially for the first run.
TGI server successfully connected
Monitoring the Deployment
You can monitor the Gaudi cards during the deployment process using the hl-smi command in a separate terminal window. This will show you:
- Memory utilization per card
- Power consumption
- Temperature
- Utilization percentages
Once the server is running, you'll see a message indicating that the model is loaded and the server is ready to accept requests.
Step 4: Running Inference with Llama 3.3-70B
Once the TGI server is running, you can start sending inference requests to generate text with Llama 3.3-70B. TGI provides a simple HTTP API that makes it easy to integrate with any application.
Basic Inference Request
Here's a simple example using curl to send a text generation request:
curl 127.0.0.1:8000/generate \
-X POST \
-d '{
"inputs":"What is Deep learning?",
"parameters":{
"max_new_tokens": 20
}
}' \
-H 'Content-Type: application/json'
This will return a JSON response containing the generated text:
Llama 3.3-70B model response via curl
Advanced Inference Parameters
TGI supports a wide range of parameters to control the text generation process. Here are some of the most useful ones:
curl 127.0.0.1:8000/generate \
-X POST \
-d '{
"inputs":"Explain the concept of transfer learning in AI:",
"parameters":{
"max_new_tokens": 256,
"temperature": 0.7,
"top_p": 0.95,
"top_k": 50,
"repetition_penalty": 1.1,
"do_sample": true
}
}' \
-H 'Content-Type: application/json'
Parameter Explanation:
- **max_new_tokens**: Maximum number of tokens to generate
- **temperature**: Controls randomness (lower = more deterministic)
- **top_p**: Nucleus sampling parameter (higher = more diverse)
- **top_k**: Limits vocabulary to top k tokens
- **repetition_penalty**: Discourages repetition (higher = less repetition)
- **do_sample**: Whether to use sampling (true) or greedy decoding (false)
Streaming Responses
For a more responsive user experience, you can use TGI's streaming API to receive tokens as they're generated:
curl 127.0.0.1:8000/generate_stream \
-X POST \
-d '{
"inputs":"Write a short poem about artificial intelligence:",
"parameters":{
"max_new_tokens": 100
}
}' \
-H 'Content-Type: application/json'
This will return a stream of Server-Sent Events (SSE) containing the generated tokens as they become available.
Python Client Integration
For more complex applications, you can use the TGI Python client:
from text_generation import Client
# Initialize the client
client = Client("http://127.0.0.1:8000")
# Generate text
response = client.generate(
"Explain quantum computing in simple terms:",
max_new_tokens=150,
temperature=0.8
)
print(response.generated_text)
# Or use streaming
for response in client.generate_stream(
"List 5 applications of machine learning in healthcare:",
max_new_tokens=200
):
if not response.token.special:
print(response.token.text, end="")
Performance Optimization and Benchmarking
To get the best performance from Llama 3.3-70B on Gaudi 2, consider the following optimization strategies:
Batch Size Tuning
The batch size significantly impacts throughput. For Llama 3.3-70B on 4 Gaudi 2 cards:
- Start with a moderate batch size (32-64)
- Gradually increase until you hit memory limits or performance plateaus
- Monitor memory usage with `hl-smi` during testing
Sequence Length Considerations
Longer sequences require more memory and computation:
- Set `max_input_length` and `max_total_tokens` based on your use case
- For chat applications, 2048-4096 tokens is often sufficient
- For document processing, you might need 8192 tokens or more
Quantization Options
TGI on Gaudi supports FP8 quantization for improved performance through Intel Neural Compressor (INC)
# Enable FP8 quantization by setting the QUANT_CONFIG environment variable
export QUANT_CONFIG=/path/to/quant_config.json
FP8 quantization offers significant benefits:
- Reduces memory bandwidth requirements roughly by half compared to BF16
- Provides up to 2x faster compute performance
- Maintains accuracy for most LLM workloads
Typical LLMs including Llama family have been validated with FP8 using INC. For detailed implementation instructions, refer to the Gaudi FP8 inference documentation.
Conclusion
Testing Environment
Evaluation was done on Intel® Tiber™ AI Cloud (ITAC), which provides access to Intel’s full portfolio of compute platforms for AI workloads—from general-purpose CPUs to specialized AI accelerators, including preview hardware.
ITAC offers pre-configured environments with optimized software stacks, integrated tools, and platform-specific documentation. Learn more at cloud.intel.com.
Intel® Liftoff for Startups
Startups building AI solutions globally can benefit from the Intel® Liftoff program through:
- Compute Access: Project-based credits for ITAC and early access to Intel hardware and software
- Engineering and GTM Support: Technical guidance from Intel engineers and optional co-marketing opportunities
- Zero Equity Model: Intel Liftoff is a no-equity program focused on technical enablement
Apply or learn more at developer.intel.com/liftoff.
Resources
Intel® Tiber™ AI Cloud - Cloud platform for AI development and deployment
Intel® Gaudi® 2 AI accelerator - High-performance AI training processor designed for deep learning workloads
References and Further Reading
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.