Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.
807 Discussions

Deploying Llama 4 Scout and Maverick Models on Intel® Gaudi® 3 with vLLM

Eugenie_Wirz
Employee
0 0 5,012

Authors: Rajashekar Kasturi (Senior AI Engineer, Intel®  Liftoff), Jaideep Kamisetti (Senior AI Engineer, Intel Liftoff) and Rahul Unnikrishnan Nair (Head of Engineering, Intel®  Liftoff)

Llama 4 is a major leap for open-weight large language models, achieving state-of-the-art results in open-domain language modeling and domain specific applications. This guide shows how to deploy and infer the Llama 4 models using vLLM, a high performance inference engine to Intel® Gaudi® 3 accelerators.

In this tutorial, we’ll focus on two key models from the Llama 4 family:

  1. Llama 4 Scout: A 17B active parameter model (109B total parameters with 16 experts) designed for efficiency while maintaining strong capabilities across a wide range of tasks.
  2. Llama 4 Maverick: A 17B active parameter model (405B total parameters with 128 experts) that delivers exceptional performance on complex reasoning, coding, and multimodal tasks.

 

 TL; DR: Key Takeaways

  • Llama 4 MoE Models: Test two Mixture-of-Experts models (Scout: 16 experts, Maverick: 128 experts) with 17B active parameters per token
  • Intel Gaudi 3: Leverage Gaudi 3’s matrix multiplication engines, TPC kernel optimizations and 128GB HBM memory for high-performant cost-effective LLM inference optimized for higher flops/dollar

Trivia: Llama 4’s Architectural Innovations

Llama 4 introduces several key architectural improvements over Llama 3, as detailed in Meta’s official blog:

Mixture of Experts (MoE) Architecture * First Meta model to use MoE, with alternating dense and sparse expert layers * Scout: 16 experts with 109 B total parameters (17B active per token) * Maverick: 128 experts with 405 B total parameters (17B active per token) * Each token activates only a fraction of the total parameters, improving efficiency

Native Multimodality * Early fusion technology integrates text, image, and video tokens into a unified model backbone * Improved vision encoder based on MetaCLIP, trained to better adapt to the LLM

Context Window Capabilities * Official specs: Scout supports 10 M tokens, Maverick supports 1 M tokens * While the theoretical limits are high, many cloud providers and practical implementations typically configure for 512K tokens for optimal performance and resource utilization * Achieved through specialized mid-training on long-context datasets

 

Quick Reference: Key Terms & Acronyms

Model Architecture Terms:

  • MoE: Mixture of Experts - Neural network architecture where different “expert” networks specialize in different inputs
  • Token: Basic unit of text processing (roughly 4 characters in English)
  • Context Window: Maximum number of tokens a model can process in a single prompt
  • Tensor Parallelism (TP): It is a model parallelism technique that shards individual layers of a neural network across multiple cards of an Accelerator/GPU to enable the use of models that are too large to fit on a single card.

Hardware & Performance Terms:

  • HPU: Habana Processing Unit - Intel’s AI accelerator architecture
  • MME: Matrix Multiplication Engine - Specialized hardware for matrix operations TPC: Tensor Processing Core - Programmable compute core for AI workloads

Precision Formats:

BF16: Brain Float 16 - 16-bit format with good dynamic range

FP8: 8-bit floating point - Lower precision format for efficient computation

 

Understanding Llama 4 Models

Llama 4 Scout (17B-16E-Instruct)

Llama 4 Scout is a 17 billion active parameter model with 16 experts, designed to deliver strong multimodal capabilities while maintaining efficiency. Some of the key technical features are:

  • Architecture: Mixture-of-Experts (MoE) with 16 experts
  • Active Parameters: 17 billion
  • Total Number of Parameters: ~109 billion (with all experts)
  • Context Window: 10 million tokens
  • Multimodal Capabilities: Native support for text and image inputs

The Scout model excels at visual understanding, reasoning, and instruction following tasks, suitable for applications that need to process both text and images efficiently.

Llama 4 Maverick (17B-128E-Instruct)

Llama 4 Maverick represents a more powerful variant with expanded expert capacity. Some of the key technical features are:

  • Architecture: Mixture-of-Experts (MoE) with 128 experts
  • Active Parameters: 17 billion
  • Total Number of Parameters: ~450 billion (with all experts)
  • Context Window: 10 million tokens
  • Multimodal Capabilities: Enhanced visual processing and reasoning
  • Hardware Requirements: Fits on a single host with multiple accelerators

The increased expert count in Maverick enables more specialized handling of different types of queries and inputs, resulting in improved performance across complex reasoning, coding, and multimodal tasks.

Llama 4 Scout vs Maverick: Side-by-Side Comparison

Feature

Llama 4 Scout

Llama 4 Maverick

Active Parameters

17 billion

17 billion

Total Parameters

~109 billion

~405 billion

Expert Count

16 experts

128 experts

Context Window

10M tokens

1M tokens

Hardware Requirements

Two Gaudi 3 cards (tp=2-4)

Multiple Gaudi 3 cards (tp=4-8)

Inference Speed

Faster

Slower but more capable

Optimal Use Cases

General-purpose tasks, content generation, summarization, multimodal

Complex reasoning, advanced coding, multimodal understanding, specialized domains

Tensor Parallelism

 tp=2-4

 tp=4-8

Recommended Precision

BF16 or FP8

FP8

 

Deployment Tip: Scout provides an excellent balance of performance and capability for most applications, while Maverick excels at complex tasks requiring deeper reasoning. For production deployments, start with Scout and only move to Maverick if you need the additional capabilities.

 

Intel Gaudi 3 Architecture

Intel® Gaudi® 3 represents the third generation of Intel’s purpose-built AI accelerators, offering significant improvements over previous generations. The architecture is specifically optimized for large language model training and inference workloads.

Eugenie_Wirz_0-1750789130071.png

Intel® Gaudi® 3 Architecture showing dual compute dies with MMEs, TPCs, on-die SRAM caches, 128GB HBM2e memory, network interfaces, and PCIe connectivity

Key Technical Specifications

  • Compute Capacity: 1.8 PFlops of FP8 and BF16 compute
  • Memory: 128 GB of HBM2e with 3.7 TB/s bandwidth
  • Architecture: Dual compute die design with 8 Matrix Multiplication Engines (MMEs) and 64 Tensor Processing Cores (TPCs)
  • Networking: 24x 200 Gbps RDMA NIC ports (4.8 Tbps total bandwidth)
  • Host Interface: PCIe Gen5 x16

Compute Engine Architecture

Gaudi 3’s heterogeneous compute architecture consists of two complementary engines that work together to optimize different aspects of AI workloads:

Matrix Multiplication Engines (MMEs)

The 8 MMEs are designed for efficient and compute dense matrix multiplication operations, which form the backbone of transformer-based LLMs:

  • Optimized for: Dense matrix operations in attention mechanisms and feed-forward networks
  • Precision support: Native FP8 (both E4M3 and E5M2 formats), BF16, FP16, and FP32
  • Key ops: Linear layers, attention projections, and MLP blocks

Tensor Processing Cores (TPCs)

The 64 TPCs are programmable VLIW (Very Long Instruction Word) processors that complements the MMEs by being:

  • Programmability: Fully programmable cores that can execute custom kernels
  • Versatility: Handle custom activation functions, and data preprocessing
  • Precision support: FP32, BF16, INT32, INT16, INT8 and other typical formats
  • Key ops: Layer norm, Softmax, GELU activations, and custom operators

Trivia: The VLIW Architecture in Gaudi TPCs

The VLIW (Very Long Instruction Word) architecture used in Gaudi’s TPCs represents a fascinating intersection of classical computer architecture principles and modern AI acceleration needs:

Historical Context

VLIW architectures emerged in the 1980s as an alternative approach to instruction-level parallelism:

Why VLIW Works for Gaudi’s AI Workloads

While VLIW struggled in general-purpose computing, it found a suitable application in Gaudi’s domain-specific application:

  • Predictable workloads: AI kernels have predictable execution patterns that compilers can optimize effectively
  • Computational density: More functional units can be packed into the same silicon area

VLIW in Gaudi TPCs

Gaudi’s implementation has several distinctive characteristics:

  • 32 SIMD lanes per TPC for vector operations
  • Multiple functional units (ALUs, load/store, transcendental functions)
  • Hardware loops to minimize control overhead
  • Direct access to local SRAM with deterministic latency

This choice of VLIW for Gaudi demonstrates how classical architectures can be adapted for modern AI workloads when matched with appropriate compiler technology.

Graph Compilation Mode

One of Gaudi 3’s key performance advantages comes from its graph compilation capabilities, known as HPU Graphs. This feature significantly improves inference performance by:

  1. Pre-compilation: The entire computational graph is analyzed and optimized before execution
  2. Kernel fusion: Multiple operations are fused into optimized kernels to reduce memory transfers
  3. Memory planning: Allocations are pre-planned to minimize fragmentation and maximize reuse
  4. Execution scheduling: Operations are scheduled to maximize hardware utilization

For LLM inference, graph compilation is particularly beneficial for the prefill phase, where the entire prompt is processed at once. The environment variable ENABLE_HPU_GRAPH=true enables this optimization, which can considerably reduce latency compared to eager execution.

Trivia: From PyTorch to HPU - The Compilation Journey

The journey from a PyTorch model to optimized execution on Gaudi hardware involves a sequence of multi-stage compilation pipeline:

  1. PyTorch Model Ingestion
  1. Graph-Level Optimizations
  • Operation fusion: Combining adjacent operations (e.g., Conv+ReLU, Linear+GELU)
  • Constant folding: Pre-computing operations with constant inputs
  • Memory layout optimization: Transforming tensor layouts for optimal access patterns
  1. Hardware-Aware Partitioning

The optimized graph is partitioned based on hardware affinity:

  • MME-targeted operations: Dense matrix multiplications and regular operations
  • TPC-targeted operations: Activations, normalizations, and custom kernels
  • Host CPU operations: Operations not supported on Gaudi or more efficient on CPU (profiling the workload to understand which ops are run on the host CPU and switching these to appropriate HPU kernels can help in minimizing host to device transfer cycles)
  1. Runtime Execution

During inference, the Synapse Runtime manages execution with features like:

  • Just-in-time compilation: For dynamic shapes or first-time execution
  • Graph caching: Reusing compiled graphs for similar inputs
  • Asynchronous execution: Overlapping host and device computation

This compilation pipeline is key to achieving optimal performance by translating PyTorch ops to efficient hardware instructions to leverage the full capabilities of Gaudi’s architecture.

 

vLLM Inference Engine and Habana Integration

vLLM is a high-performance inference engine designed specifically for large language models. Working closely with the open source community, Intel’s teams has made significant contributions to the vLLM project to enable optimized execution on Gaudi hardware (and also for Xeon and XPUs), creating a specialized fork with Gaudi-specific enhancements.

Core vLLM Features

  • PagedAttention: Scalable management of the key-value (KV) cache by abstracting it as a paged virtual memory structure, enabling memory efficient contiguous use (logically) of non-contiguous memory blocks for KV cache
  • Continuous Batching: Dynamically processes incoming requests without waiting for a full batch, improving throughput
  • Tensor Parallelism: Distributes model weights across multiple accelerators, enabling efficient inference of large models that wouldn’t fit on a single device
  • Quantization Support: Includes support for various precision formats, including FP8 which is natively supported by Gaudi 3
  • OpenAI-compatible API: Provides a familiar interface for application integration

Testing Environment

  • Intel® Tiber™ AI Cloud
  • Hardware: 8 way Gaudi3 Deep Learning Server
  • Synapse AI: v1.20.1

 

Step 1: Setting Up the Environment

Deploying Llama 4 models on Gaudi 3 requires a properly configured environment with the appropriate software stack. We’ll use Intel’s optimized container images and the Habana-specific fork of vLLM that includes optimizations for Gaudi hardware.

Container Configuration and Performance Optimization

The first step is to launch a container with the Habana Synapse AI software stack.

# Define container parameters
export IMAGE_NAME=vault.habana.ai/gaudi-docker/1.20.1/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
export CONTAINER_NAME=llama4-vllm-1.20

The container image includes: - PyTorch 2.6.0 with Habana-specific optimizations and custom operators - Synapse AI 1.20.1 runtime with the latest compiler (at the time of testing) and runtime optimizations - Ubuntu 22.04 base OS with necessary system libraries

Next, we launch the container with specific configurations that are crucial for maximizing inference performance:

docker run -d -it --runtime=habana \
--cap-add=sys_nice \
--ipc=host \
--net=host \
-e HABANA_VISIBLE_DEVICES=all \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
-e HF_HOME=/data/huggingface \
-v /software:/software \
--name $CONTAINER_NAME \
$IMAGE_NAME /bin/bash

# Connect to the running container
docker exec -it $CONTAINER_NAME /bin/bash

Critical Docker Parameters and Their Performance Impact

  • --runtime=habana: Enables the Habana container runtime, which provides direct access to Gaudi devices through a custom OCI runtime hook. This is essential for the container to communicate with the Habana Driver and Synapse Runtime.
  • HABANA_VISIBLE_DEVICES=all: Controls which Gaudi devices are visible to the container. Setting this to “all” makes all Gaudi accelerators available, enabling tensor parallelism across the full system. For isolation, you can specify specific device indices (e.g., “0,1,2,3”).
  • OMPI_MCA_btl_vader_single_copy_mechanism=none: This setting prevents potential deadlocks in the MPI communication layer and improves stability for long-running inference servers.

These configuration parameters work together to create an environment that minimizes overhead between the application and the hardware, allowing the Gaudi accelerators to operate at peak efficiency for LLM inference workloads.

Installing vLLM with Gaudi Support

Let’s clone and set up vLLM with Gaudi support (inside the container).

# Create a working directory
mkdir -p /software/users/ && cd /software/users

# Clone the Habana-optimized vLLM fork with Llama 4 support
git clone https://github.com/HabanaAI/vllm-fork -b llama4
cd vllm-fork

# Install dependencies
pip install -r requirements-hpu.txt

# Ensure compatible numpy version
pip install numpy==1.26.4

# Install the HPU extension package
pip install git+https://github.com/HabanaAI/vllm-hpu-extension.git@145c63d

# Install additional dependencies for the OpenAI-compatible API server
pip install pydantic msgspec cachetools cloudpickle psutil zmq blake3 py-cpuinfo \
    aiohttp openai uvloop fastapi uvicorn watchfiles partial_json_parser \
    python-multipart gguf llguidance prometheus_client numba compressed_tensors

 

Step 2: Deploying Llama 4 Scout

Hardware Requirements

Important Note: Llama 4 Scout has 17B active parameters per token (the shared expert, one routed expert, and router networks), but as an MoE model, the full ~109B parameters (all 16 experts) must be loaded into memory. This requires more substantial hardware than 17B dense models. Based on our observations:

  • Minimum Configuration: At least 2 cards of Gaudi 3 accelerators with tensor parallelism
  • Recommended Configuration: 4 Gaudi 3 accelerators for production workloads (for extended context window and high concurrency)

In this section, we’ll deploy and evaluate the Llama 4 Scout model (17B-16E-Instruct), which balances capabilities with hardware requirements. This model is particularly well-suited for applications that need multimodal understanding without requiring the full capacity of the larger Maverick model.

Downloading the Model

First, we need to download the model weights from Hugging Face. Since Llama 4 models are gated,you will need to accept the terms of use and get a valid Hugging Face token with appropriate access permissions.

# Set environment variables
export MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct
export HF_TOKEN=<hf_token>  # Replace with your actual token

# Create directory for model storage
mkdir -p data/models && cd data/models

# Install required packages
pip install huggingface-hub datasets

# Download model weights
huggingface-cli download --local-dir Llama-4-Scout-17B-16E-Instruct ${MODEL} --token ${HF_TOKEN}

Quick Start: 5-Minute Validation

Want to quickly verify your setup works? Here’s a minimal Python script to test Llama 4 Scout on Gaudi:

import os
import time
import subprocess
from vllm import LLM, SamplingParams

# First, check if Gaudi cards are visible and properly configured
print("Checking Gaudi hardware configuration...")
try:
    hl_smi_output = subprocess.check_output(["hl-smi"], text=True)
    print(hl_smi_output)
    # Count available devices
    device_count = hl_smi_output.count("Device ID")
    print(f"Found {device_count} Gaudi devices")
    if device_count < 2:
        print("WARNING: Llama 4 Scout performs best with at least 2 Gaudi cards")
except Exception as e:
    print(f"Error checking Gaudi hardware: {e}")
    print("Please ensure Habana drivers are properly installed")
    exit(1)

# Set environment variables for optimal performance
os.environ["HABANA_VISIBLE_DEVICES"] = "0,1"  # Use first two cards
os.environ["ENABLE_HPU_GRAPH"] = "1"         # Enable HPU Graph for better performance
os.environ["PT_HPU_LAZY_MODE"] = "1"         # Enable lazy mode for better memory management

# Initialize the model with tensor parallelism
print("\nInitializing Llama 4 Scout model with tensor parallelism...")
start_time = time.time()
model_path = "/data/models/Llama-4-Scout-17B-16E-Instruct/"

llm = LLM(
    model=model_path,
    tensor_parallel_size=2,  # Use 2 Gaudi cards in parallel
    dtype="bfloat16",       # Use BF16 precision for optimal performance
    max_model_len=32,      # Set reasonable context length for testing
    enforce_eager=False,     # Use HPU Graph compilation
)
print(f"Model loaded in {time.time() - start_time:.2f} seconds")

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=64
)

# Run a simple test prompt
prompt = """You are an AI assistant built by Meta. Please explain in 1 sentence how Mixture of Experts architecture works in LLMs."""

print("\nRunning inference...")
start_time = time.time()
outputs = llm.generate([prompt], sampling_params)
inference_time = time.time() - start_time

# Print results
generated_text = outputs[0].outputs[0].text
print(f"\nGenerated response:\n{generated_text}")
print(f"\nInference completed in {inference_time:.2f} seconds")
print(f"Generation speed: {len(generated_text.split()) / inference_time:.2f} tokens/second")

This minimal example lets you quickly verify that:

  1. The Gaudi hardware is properly detected
  2. Model loads correctly
  3. Inference works end-to-end
  4. You’re getting sensible outputs

Deploying as an OpenAI-compatible API Server

One of vLLM’s key features is its ability to serve models through an OpenAI-compatible API, making it easy to integrate with existing applications that use the OpenAI SDK.

# Set port for the API server
export PORT=8000

# Launch the server with tensor parallelism across 8 Gaudi devices
# Set environment variables for optimal performance
export HABANA_VISIBLE_DEVICES="ALL"
export PT_HPU_ENABLE_LAZY_COLLECTIVES=true
export PT_HPU_WEIGHT_SHARING=0

# Launch the API server
python3 -m vllm.entrypoints.openai.api_server \
    --model /data/models/Llama-4-Scout-17B-16E-Instruct/ \
    --tensor-parallel-size 4 \
    --max-model-len 2048 \
    --port $PORT \
    --host :: \
    --dtype bfloat16 \
    --use-v2-block-manager \
    --block-size 128 \
    --distributed_executor_backend ray \
    --gpu_memory_utilization 0.95 \
    --trust_remote_code

Key parameters explained:

  • Environment variables: - HABANA_VISIBLE_DEVICES="ALL": Makes all Gaudi accelerators available to the application
  • PT_HPU_ENABLE_LAZY_COLLECTIVES=true: Optimizes collective operations for better performance
  • PT_HPU_WEIGHT_SHARING=0: Disables weight sharing for improved throughput
  • API server parameters:
    • --model: Path to the model directory containing weights and configuration
    • --tensor-parallel-size 4: Enables tensor parallelism across 4 Gaudi devices
    • --max-model-len 2048: Sets the maximum sequence length to 2048 tokens
    • --dtype bfloat16: Uses BF16 precision for model weights and activations
    • --use-v2-block-manager: Enables the improved block allocation algorithm
    • --block-size 128: Sets the KV cache block size to 128 tokens
    • --distributed_executor_backend ray: Uses Ray for distributed execution
    • --gpu_memory_utilization 0.95: Allocates 95% of available HBM memory for the model graph and KV cache

Testing Multimodal Capabilities

Llama 4 Scout’s multimodal capabilities allow it to process both text and images. We can test this functionality using simple curl commands to the API server.

Text-only Query

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "prompt": "India or Bharat is a",
        "max_tokens": 256,
        "temperature": 0
    }'

Image Understanding

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
      "prompt": "![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?",
      "max_tokens": 256,
      "seed": 42
  }'

Visual Analysis with Reasoning

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
      "prompt": "![](https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/europe.png)what countries are on the map, specify only those that are with a text",
      "seed": 42,
      "temperature": 0.6,
      "top_p": 0.9,
      "max_tokens": 128
  }'

The model demonstrates its ability to process and understand images, as shown in the sample output below:

Eugenie_Wirz_1-1750789130076.png

Llama 4 Scout multimodal responses

Advanced Configuration for Llama 4 Scout

When deploying Llama 4 Scout for production use cases, several configuration parameters can be tuned to optimize performance based on your specific requirements. Here we explore key configuration options and their impact on the model’s behavior.

Tensor Parallelism Optimization

Tensor parallelism is a critical technique for distributing model weights across multiple accelerators. For the Llama 4 Scout model on Gaudi 3, we can adjust the tensor parallelism degree based on the available hardware:

# For maximum throughput on a full Gaudi 3 system (8 cards)
export HABANA_VISIBLE_DEVICES="ALL"
export PT_HPU_ENABLE_LAZY_COLLECTIVES=true
export PT_HPU_WEIGHT_SHARING=0
python3 -m vllm.entrypoints.openai.api_server \
    --model /data/models/Llama-4-Scout-17B-16E-Instruct/ \
    --tensor-parallel-size 8 \
    --distributed_executor_backend ray \
    --gpu_memory_utilization 0.95


# For scenarios where you need to run multiple models on the same system
export HABANA_VISIBLE_DEVICES="ALL"
export PT_HPU_ENABLE_LAZY_COLLECTIVES=true
export PT_HPU_WEIGHT_SHARING=0
python3 -m vllm.entrypoints.openai.api_server \
    --model /data/models/Llama-4-Scout-17B-16E-Instruct/ \
    --tensor-parallel-size 2 \
    --distributed_executor_backend ray \
    --gpu_memory_utilization 0.95

The optimal tensor parallelism (tp) degree depends on the model size, input sequence length concurrency, context length and batch size requirements.

Context Length Management

Llama 4 models support extremely long context windows (up to 10 million tokens), but processing very long sequences requires significant memory and compute resources. The --max_model_len parameter allows you to balance context length capabilities with resource efficiency:

# For standard conversational AI (2K context)
export HABANA_VISIBLE_DEVICES="ALL"
export PT_HPU_ENABLE_LAZY_COLLECTIVES=true
export PT_HPU_WEIGHT_SHARING=0
python3 -m vllm.entrypoints.openai.api_server \
    --model /data/models/Llama-4-Scout-17B-16E-Instruct/ \
    --max-model-len 2048 \
    --distributed_executor_backend ray \
    --gpu_memory_utilization 0.95


# For long-context applications (32K context)
export HABANA_VISIBLE_DEVICES="ALL"
export PT_HPU_ENABLE_LAZY_COLLECTIVES=true
export PT_HPU_WEIGHT_SHARING=0
python3 -m vllm.entrypoints.openai.api_server \
    --model /data/models/Llama-4-Scout-17B-16E-Instruct/ \
    --max-model-len 32768 \
    --distributed_executor_backend ray \
    --gpu_memory_utilization 0.95

Longer context lengths enable the model to process more information at once but increase memory usage and may impact throughput.

Quantization Options

vllm-hpu-extension

Gaudi 3 accelerators support FP8 precision, which can significantly improve inference performance while maintaining model quality. Enabling FP8 quantization requires a calibration process followed by proper configuration:

# Step 1: Calibrate the model using the calibrate_model.sh script from vllm-hpu-extension
# This generates a quantization configuration file (e.g., maxabs_quant_g3.json)

# Step 2: Set the environment variable to point to the quantization config file
export QUANT_CONFIG=/path/to/quant/config/inc/llama-4-scout/maxabs_quant_g3.json

# Step 3: Enable FP8 quantization for Llama 4 Scout using the INC approach
export HABANA_VISIBLE_DEVICES="ALL"
export PT_HPU_ENABLE_LAZY_COLLECTIVES=true
export PT_HPU_WEIGHT_SHARING=0
python3 -m vllm.entrypoints.openai.api_server \
    --model /data/models/Llama-4-Scout-17B-16E-Instruct/ \
    --tensor-parallel-size 8 \
    --quantization inc \
    --kv-cache-dtype fp8_inc \
    --distributed_executor_backend ray \
    --gpu_memory_utilization 0.95 \
    --trust_remote_code

This approach uses Intel Neural Compressor (INC) for FP8 quantization, which could provide upto 2x performance improvement (depending on the model) compared to BF16, with minimal impact on model quality for most use cases.

Trivia: Numerical Precision in AI Acceleration

The numerical precision formats used in AI hardware represent fascinating trade-offs between computational efficiency, memory usage, and numerical accuracy:

FP8 (8-bit floating point) - A balanced approach to AI computation efficiency

While lower precision formats exist (such as FP4 and even FP2), FP8 represents a practical balance between efficiency and accuracy. Gaudi 3 supports two FP8 variants:

  • E4M3: 4 exponent bits, 3 mantissa bits, 1 sign bit
    • Range: ±448
    • Smallest positive normal: 2^-6 ~ 0.015625

Well-suited for weight storage due to its wider dynamic range

  • E5M2: 5 exponent bits, 2 mantissa bits, 1 sign bit
    • Range: ±57,344
    • Smallest positive normal: 2^-14 ~ 0.00006103

Better for activation values that can span many orders of magnitude

BF16 (Brain Floating Point) - The workhorse of modern AI computation

  • 8 exponent bits, 7 mantissa bits, 1 sign bit
  • Same exponent size as FP32, but with reduced mantissa precision
  • Range: ±3.39×10^38 (identical to FP32)
  • Preserves the dynamic range of FP32 while reducing memory footprint by 50%
  • Developed by Google for their TPUs and now widely adopted in AI hardware

FP16 (16-bit floating point) - The IEEE 754 half-precision standard

  • 5 exponent bits, 10 mantissa bits, 1 sign bit
  • Range: ±65,504
  • Limited dynamic range compared to BF16, but higher precision within that range
  • Still useful for certain inference workloads with controlled numerical ranges

FP32 (32-bit floating point) - The traditional standard

  • 8 exponent bits, 23 mantissa bits, 1 sign bit
  • Range: ±3.4×10^38
  • The baseline for numerical accuracy in deep learning
  • Used for accumulation in Gaudi 3’s MMEs (internally) to prevent precision loss

Ability to use different precision formats for different parts of the model is a key advantage in modern AI hardware, allowing developers to select the appropriate format for each operation based on accuracy and performance requirements.

 

Step 3: Deploying Llama 4 Maverick

Hardware Requirements

Note: Llama 4 Maverick has 17B active parameters per token (the shared expert, one routed expert, and router networks), but as an MoE model with 128 experts, the full ~405B parameters must be loaded into memory. This demands substantial hardware resources:

  • Tested Configuration: Full 8-card Gaudi 3 system with tensor parallelism
  • Accelerator Memory: All 8 cards with 128GB HBM2e memory each
  • Network: High-speed inter-card communication leveraging Gaudi’s integrated RDMA

Llama 4 Maverick represents a significant step up in capability from Scout, with its 128-expert architecture providing enhanced reasoning, coding, and multimodal understanding. Deploying this model follows a similar process but requires careful consideration of its larger resource requirements.

Model Configuration for Maverick

The Maverick model benefits from FP8 quantization on Gaudi 3, which helps manage its larger parameter count while maintaining performance. Similar to Scout, this requires proper calibration:

# Step 1: Calibrate the Maverick model using the calibrate_model.sh script
# This generates a quantization configuration file specific for Maverick

# Step 2: Set the environment variable to point to the Maverick quantization config
export QUANT_CONFIG=/path/to/quant/config/inc/llama-4-maverick/maxabs_quant_g3.json

# Step 3: Deploy Maverick with FP8 quantization using INC
export MODEL_PATH=/data/models/Llama-4-Maverick-17B-128E-Instruct

export HABANA_VISIBLE_DEVICES="ALL"
export PT_HPU_ENABLE_LAZY_COLLECTIVES=true
export PT_HPU_WEIGHT_SHARING=0
python3 -m vllm.entrypoints.openai.api_server \
    --model $MODEL_PATH \
    --tensor-parallel-size 8 \
    --quantization inc \
    --kv-cache-dtype fp8_inc \
    --max-model-len 4096 \
    --host :: \
    --port 8000 \
    --distributed_executor_backend ray \
    --gpu_memory_utilization 0.95 \
    --use-v2-block-manager \
    --block-size 128 \
    --trust_remote_code

Key considerations for Maverick deployment:

  1. Memory Management: The larger expert count requires more careful memory management
  2. Tensor Parallelism: Full tensor parallelism (tp=8) is recommended for optimal performance (for long context and high concurrency usecases)
  3. Quantization: FP8 quantization with proper calibration is particularly beneficial for this model

 

Conclusion

Deploying Llama 4 Scout and Maverick models on Intel Gaudi 3 accelerators with vLLM provides an efficient path to leveraging these powerful multimodal models for a wide range of applications. The combination of Gaudi 3’s specialized AI architecture and vLLM’s optimized inference engine enables high-performance, cost-effective deployment of these state-of-the-art models.

Intel® Tiber™ AI Cloud

All testing was done on Intel® Tiber™ AI Cloud (ITAC), which provides access to Intel’s full portfolio of compute platforms for AI workloads - from general-purpose CPUs to specialized AI accelerators, including Gaudi 3.

ITAC offers several advantages for AI development and deployment:

  • Pre-configured Environments: Ready-to-use software stacks optimized for AI workloads
  • Diverse Hardware Options: Access to the latest Intel AI accelerators, including preview hardware

To learn more about Intel® Tiber™ AI Cloud and how to access its resources, visit cloud.intel.com.

Intel® Liftoff Program

Startups building AI solutions globally can benefit from the Intel® Liftoff program through:

  • Compute Access: Project-based credits for ITAC and early access to Intel hardware and software
  • Engineering and GTM Support: Technical guidance from Intel engineers and optional co-marketing opportunities
  • Zero Equity Model: Intel® Liftoff is a no-equity program focused on technical enablement

Apply or learn more at developer.intel.com/liftoff

References

Related resources

Intel® Tiber™ AI Cloud - Cloud platform for AI development and deployment

Intel® Gaudi® 2 AI accelerator - High-performance AI training processor designed for deep learning workloads

Benchmark Intel® Gaudi® 2 AI Accelerator for Large Language Models

About the Author
I'm a proud team member of the Intel® Liftoff for Startups, an innovative, free virtual program dedicated to accelerating the growth of early-stage AI startups.