Accelerating Llama 3.3-70B Inference on Intel® Gaudi® 2 via Hugging Face Text Generation Inference

Eugenie_Wirz · ‎06-23-2025

Authors: Rajashekar Kasturi (Senior AI Engineer) and Rahul Unnikrishnan Nair (Head of Engineering, Intel® Liftoff)

The rise of Large Language Models (LLMs) has significantly expanded what’s possible in natural language applications, but effectively deploying these massive models for production inference remains a significant challenge. As models like Llama 3.3-70B push the boundaries of what's possible in natural language understanding and generation, organizations need efficient, scalable, and cost-effective inference solutions.

Text Generation Inference (TGI) from Hugging Face has emerged as a leading solution for deploying LLMs in production environments. TGI is a production-ready inference system with support for features like:

Continuous batching for maximizing throughput
Token streaming for responsive user experiences
Tensor parallelism for distributing model weights across multiple accelerators
Quantization support for reduced memory footprint
Flash Attention and other optimizations for faster inference

This guide shows how to deploy TGI on Intel® Gaudi® 2 AI Accelerators to efficiently deploy and serve Meta's Llama 3.3-70B model. Gaudi accelerators provide a viable hardware option for LLM inference, with competitive performance and cost.

We'll demonstrate a practical implementation using 4 cards of an 8-card Gaudi 2 node, showcasing how to effectively partition resources for multiple workloads or users on a single system.

Understanding Llama 3.3-70B and Its Capabilities

Llama 3.3-70B is Meta’s instruction-tuned large language model from the Llama 3 family, building on the foundation of Llama 3.1 with additional post-training improvements. It is optimized for instruction following, multilingual reasoning, and structured output generation.

Key capabilities include:

Strong performance on academic benchmarks, including 86.0% accuracy on MMLU (Massive Multitask Language Understanding)
128K token context window, enabling long-sequence processing
Improved instruction tuning, leading to more reliable responses
Structured output support, including JSON and function call formats

The model maintains the same architecture as Llama 3.1-70B but incorporates the latest advancements in post-training techniques, resulting in significantly better evaluation performance while maintaining efficient inference characteristics.

Step 1: Configure Instance and Access

To begin our implementation, we need to provision and access an Intel® Gaudi® 2 AI accelerator instance. The Intel® Tiber™ AI Cloud provides easy access to Intel® Gaudi® 2 AI accelerators.

Launching an Intel® Gaudi® 2 AI accelerator Instance

Create an account on Intel Tiber AI Cloud

Visit the Intel® Tiber™ AI Cloud console
Sign up for an account if you don't already have one

Generate SSH Keys

Create an SSH key pair for secure access to your instance
Save the private key securely on your local machine
Add the public key to your Intel Tiber AI Cloud account

Launch an Intel® Gaudi® 2 AI accelerator Instance

Select the instance type with 8 cards
Choose the OS image with the latest SynapseAI* Software Suite version (v1.20.1 or newer)
Select your SSH key for authentication
Optionally enable Jupyter Notebook access if needed

Connect to Your Instance

Once the instance is in the "Ready" state, copy the SSH command
Connect from your terminal.

Verify Accelerator Availability

Run `hl-smi` to check the status of your Gaudi cards
You should see 8 Gaudi 2 cards available

Intel® Gaudi® 2 AI accelerator cards status via hl-smi command

Understanding Intel® Gaudi® 2 AI accelerator Architecture

Intel® Gaudi® 2 accelerators are purpose-built for deep learning workloads with a specialized architecture designed for AI training and inference. Key features include:

24 Tensor Processor Cores (TPCs) for efficient matrix operations
96GB of HBM2E memory with 2.45TB/sec bandwidth
Integrated 100GbE RoCE RDMA NICs for scalable multi-card configurations
Dedicated Matrix Multiplication Engines (MME) for accelerating deep learning operations
Support for mixed precision (FP32, BF16, FP16, INT8) computation

This architecture makes Intel® Gaudi® 2 AI accelerator a strong fit for LLM inference workloads, offering a balance of performance and cost-effectiveness.

Step 2: Setting Up Text Generation Inference for Gaudi

Hugging Face's (TGI) has been optimized to run efficiently on Gaudi hardware through a dedicated backend. This integration enables efficient LLM serving on Intel's AI accelerators.

Understanding TGI on Gaudi

The Gaudi backend for TGI provides several key optimizations:

HPU Graph optimization for faster execution
Flash Attention support for efficient attention computation
FP8 precision inference for reduced memory footprint
Tensor parallelism for distributing model weights across multiple cards
Continuous batching for maximizing throughput

Preparing Your Environment

The easiest way to run TGI on Gaudi is to use the official Docker image. Let's set up our environment variables:

# Specify the TGI Gaudi image version
export IMAGE_NAME=ghcr.io/huggingface/text-generation-inference:latest-gaudi

# Name your container
export CONTAINER_NAME=llama3-70b

# Specify the model to deploy
export MODEL=meta-llama/Llama-3.3-70B-Instruct

# Set your Hugging Face token (required for accessing gated models)
export HF_TOKEN=<hf_token>

# Create a volume for persistent storage
export VOLUME=$PWD/data

**Note:** You can find various TGI image tags [here](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference). For production deployments, it's recommended to use a specific version tag rather than `latest-gaudi`.

Supported Models

TGI on Gaudi supports a wide range of LLMs, including:

Llama 2/3/3.1/3.3 (various sizes)
Mixtral-8x7B
Mistral-7B
Qwen2 models
Gemma models
Falcon models
And many more

Check the complete list of supported models in the TGI documentation.

Authentication Setup

To access gated models like Llama 3.3-70B, you'll need a Hugging Face token:

Create or log in to your Hugging Face account
Navigate to [Settings > Access Tokens](https://huggingface.co/settings/tokens)
Generate a new token with read permissions
Set this token as the `HF_TOKEN` environment variable

Volume Mounting Strategy

To avoid downloading model weights every time you run the container, we'll mount a local directory to the container.|

Understanding TGI Parameters

Before deploying the model, it's important to understand the key parameters that could affect performance:

Sequence length parameters:
`--max-input-length`: Maximum possible input prompt length (default: 4095)
`--max-total-tokens`: Maximum possible total sequence length (input + output) (default: 4096)
Batch size parameters:
`--max-batch-prefill-tokens`: Set as batch_size * max-input-tokens
`--max-batch-size`: Maximum batch size for decode operations
Performance and memory parameters:
`ENABLE_HPU_GRAPH`: Enables HPU graphs usage (crucial for performance)
`LIMIT_HPU_GRAPH`: Limits HPU graph memory usage
`USE_FLASH_ATTENTION`: Enables optimized attention implementation
`FLASH_ATTENTION_RECOMPUTE`: Controls memory vs. computation tradeoff

Step 3: Deploying Llama 3.3-70B on Intel® Gaudi® 2 AI accelerator

Now we're ready to deploy the Llama 3.3-70B model on our Intel® Gaudi® 2 AI accelerator instance. In this example, we'll use only 4 of the 8 available Gaudi cards, demonstrating how to partition resources for multiple workloads on a single system.

Resource Allocation Strategy

By using the HABANA_VISIBLE_DEVICES environment variable, we can control which Gaudi cards are accessible to our container. This approach allows for:

Resource partitioning for multiple independent workloads
Isolation between different model deployments
Efficient utilization of available hardware

Model Sharding Configuration

For large models like Llama 3.3-70B, we need to distribute the model weights across multiple accelerators using tensor parallelism. This is configured with:

`--sharded true`: Enables model sharding
`--num-shard 4`: Distributes the model across 4 Gaudi cards

Deployment Command

Here's the complete command to deploy Llama 3.3-70B on 4 Gaudi cards:

docker run -it --runtime=habana \
-e HABANA_VISIBLE_DEVICES=0,1,2,3 \
--cap-add=sys_nice \
--net=host \
--ipc=host \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
-e HF_TOKEN=$HF_TOKEN \
-e ENABLE_HPU_GRAPH=true \
-e LIMIT_HPU_GRAPH=true \
-e USE_FLASH_ATTENTION=true \
-e FLASH_ATTENTION_RECOMPUTE=true \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $VOLUME:/data \
--name $CONTAINER_NAME \
$IMAGE_NAME \
--model-id $MODEL \
--port 8000 \
--sharded true --num-shard 4 \
--max-input-length 1024 \
--max-total-tokens 2048 \
--max-batch-size 128 \
--max-waiting-tokens 10 \
--max-concurrent-requests 128 \
--cuda-graphs 0

TGI server starting up

Understanding Key Parameters

Let's break down the important parameters in this deployment command:

Docker and Gaudi Configuration

`--runtime=habana`: Specifies the Habana runtime for Gaudi support
`-e HABANA_VISIBLE_DEVICES=0,1,2,3`: Limits the container to use only the first 4 Gaudi cards
`--cap-add=sys_nice`: Allows the container to adjust process priorities
`--net=host`: Uses the host network stack for simplified networking
`--ipc=host`: Uses the host IPC namespace for shared memory

TGI Optimization Parameters

`-e ENABLE_HPU_GRAPH=true`: Enables HPU graph optimization for better performance
`-e LIMIT_HPU_GRAPH=true`: Prevents excessive memory usage by HPU graphs
`-e USE_FLASH_ATTENTION=true`: Enables optimized attention implementation
`-e FLASH_ATTENTION_RECOMPUTE=true`: Trades computation for memory efficiency

Model Serving Parameters

`--port 8000`: Sets the HTTP port for the TGI server
`--max-input-length 1024`: Limits input prompts to 1024 tokens
`--max-total-tokens 2048`: Sets the maximum combined input+output tokens
`--max-batch-size 128`: Controls the maximum batch size for inference
`--max-concurrent-requests 128`: Limits the number of concurrent requests
`--max-waiting-tokens 10`: Controls the token generation pipeline

TGI 3.0+ Zero-Config Approach

With TGI version 3.0 and above, Hugging Face has introduced a "zero-config" approach that automatically optimizes parameters based on the hardware and model. For many deployments, you can simplify the command to:

# Simplified deployment with automatic optimizations
docker run -it --runtime=habana \
-e HABANA_VISIBLE_DEVICES=0,1,2,3 \
--cap-add=sys_nice \
--net=host \
--name $CONTAINER_NAME \
$IMAGE_NAME \
--model-id $MODEL \
--sharded true --num-shard 4

TGI now carefully evaluates the hardware and model to select optimal values for parameters like batch sizes and token limits. According to Hugging Face, removing most configuration flags often results in the best performance for most scenarios. The system dynamically adjusts to provide optimal throughput and memory usage.

Deployment Progress

When you run this command, TGI will:

Download the model weights (if not already cached)
Load the model across the 4 Gaudi cards
Perform warmup operations to optimize performance
Start the HTTP server on port 8000

The entire process may take several minutes, especially for the first run.

TGI server successfully connected

Monitoring the Deployment

You can monitor the Gaudi cards during the deployment process using the hl-smi command in a separate terminal window. This will show you:

Memory utilization per card
Power consumption
Temperature
Utilization percentages

Once the server is running, you'll see a message indicating that the model is loaded and the server is ready to accept requests.

Step 4: Running Inference with Llama 3.3-70B

Once the TGI server is running, you can start sending inference requests to generate text with Llama 3.3-70B. TGI provides a simple HTTP API that makes it easy to integrate with any application.

Basic Inference Request

Here's a simple example using curl to send a text generation request:

curl 127.0.0.1:8000/generate \
    -X POST \
    -d '{
  "inputs":"What is Deep learning?",
  "parameters":{
    "max_new_tokens": 20
  }
}' \
    -H 'Content-Type: application/json'

This will return a JSON response containing the generated text:

Llama 3.3-70B model response via curl

Advanced Inference Parameters

TGI supports a wide range of parameters to control the text generation process. Here are some of the most useful ones:

curl 127.0.0.1:8000/generate \
    -X POST \
    -d '{
  "inputs":"Explain the concept of transfer learning in AI:",
  "parameters":{
    "max_new_tokens": 256,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 50,
    "repetition_penalty": 1.1,
    "do_sample": true
  }
}' \
    -H 'Content-Type: application/json'

Parameter Explanation:

**max_new_tokens**: Maximum number of tokens to generate
**temperature**: Controls randomness (lower = more deterministic)
**top_p**: Nucleus sampling parameter (higher = more diverse)
**top_k**: Limits vocabulary to top k tokens
**repetition_penalty**: Discourages repetition (higher = less repetition)
**do_sample**: Whether to use sampling (true) or greedy decoding (false)

Streaming Responses

For a more responsive user experience, you can use TGI's streaming API to receive tokens as they're generated:

curl 127.0.0.1:8000/generate_stream \
    -X POST \
    -d '{
  "inputs":"Write a short poem about artificial intelligence:",
  "parameters":{
    "max_new_tokens": 100
  }
}' \
    -H 'Content-Type: application/json'

This will return a stream of Server-Sent Events (SSE) containing the generated tokens as they become available.

Python Client Integration

For more complex applications, you can use the TGI Python client:

from text_generation import Client

# Initialize the client
client = Client("http://127.0.0.1:8000")

# Generate text
response = client.generate(
    "Explain quantum computing in simple terms:",
    max_new_tokens=150,
    temperature=0.8
)

print(response.generated_text)

# Or use streaming
for response in client.generate_stream(
    "List 5 applications of machine learning in healthcare:",
    max_new_tokens=200
):
    if not response.token.special:
        print(response.token.text, end="")

Performance Optimization and Benchmarking

To get the best performance from Llama 3.3-70B on Gaudi 2, consider the following optimization strategies:

Batch Size Tuning

The batch size significantly impacts throughput. For Llama 3.3-70B on 4 Gaudi 2 cards:

Start with a moderate batch size (32-64)
Gradually increase until you hit memory limits or performance plateaus
Monitor memory usage with `hl-smi` during testing

Sequence Length Considerations

Longer sequences require more memory and computation:

Set `max_input_length` and `max_total_tokens` based on your use case
For chat applications, 2048-4096 tokens is often sufficient
For document processing, you might need 8192 tokens or more

Quantization Options

TGI on Gaudi supports FP8 quantization for improved performance through Intel Neural Compressor (INC)

# Enable FP8 quantization by setting the QUANT_CONFIG environment variable
export QUANT_CONFIG=/path/to/quant_config.json

FP8 quantization offers significant benefits:

Reduces memory bandwidth requirements roughly by half compared to BF16
Provides up to 2x faster compute performance
Maintains accuracy for most LLM workloads

Typical LLMs including Llama family have been validated with FP8 using INC. For detailed implementation instructions, refer to the Gaudi FP8 inference documentation.

Conclusion

Testing Environment

Evaluation was done on Intel® Tiber™ AI Cloud (ITAC), which provides access to Intel’s full portfolio of compute platforms for AI workloads—from general-purpose CPUs to specialized AI accelerators, including preview hardware.

ITAC offers pre-configured environments with optimized software stacks, integrated tools, and platform-specific documentation. Learn more at cloud.intel.com.

Intel® Liftoff for Startups

Startups building AI solutions globally can benefit from the Intel® Liftoff program through:

Compute Access: Project-based credits for ITAC and early access to Intel hardware and software
Engineering and GTM Support: Technical guidance from Intel engineers and optional co-marketing opportunities
Zero Equity Model: Intel Liftoff is a no-equity program focused on technical enablement

Apply or learn more at developer.intel.com/liftoff.

Resources

Intel® Tiber™ AI Cloud - Cloud platform for AI development and deployment

Intel® Gaudi® 2 AI accelerator - High-performance AI training processor designed for deep learning workloads