Exploring Vision-Language Models (VLMs) with Text Generation Inference on Intel® Data Center GPU Max

Eugenie_Wirz · ‎06-23-2025

Authors: Rajashekar Kasturi (Senior AI Engineer, Intel® Liftoff) and Rahul Unnikrishnan Nair (Head of Engineering, Intel® Liftoff)

This technical guide, written by the Intel® Liftoff technical mentors, demonstrates how to deploy and serve state-of-the-art VLMs using Text Generation Inference (TGI) with hardware-specific optimizations for performant inference on Intel® XPUs. The examples in this guide use the Intel® Tiber™ AI Cloud environment, but the techniques and optimizations can be applied to Intel® XPUs in various deployment scenarios.

Intel®'s Data Center GPU Max Series (XPU) features up to 128 Xe cores and over 100 billion transistors in a single package. These GPUs are designed to accelerate AI workloads such as vision-language models, using features like Intel® Xe Matrix Extensions (XMX) and large L2 cache configurations (up to 408MB).

In this technical walk-through, we'll guide you through the complete workflow of setting up your Intel® Max Series GPU virtual machine, launching a TGI container specifically optimized for Intel® XPUs with Intel® oneAPI Base Toolkit acceleration, and serving a production-ready VLM like Qwen2.5-VL. You'll learn how TGI's supports efficient inference through hardware-aware optimizations for Intel® XPUs.

For more information on products, pricing and solutions visit: https://ai.cloud.intel.com/

Why Text Generation Inference (TGI) for VLMs on Intel® Data Center GPU Max Series?

Text Generation Inference (TGI) by Hugging Face is a versatile, production-ready serving solution engineered specifically for Large Language Models (LLMs) and, increasingly, Vision-Language Models (VLMs). When deployed on Intel® GPUs, TGI delivers exceptional performance through hardware-specific optimizations:

Intel® Data Center GPU Max Series Architecture

Feature	Specification	Benefit for VLM Inference
Xe Cores	Up to 128 Xe cores	Parallel processing of vision and language components
Memory	Up to 128GB HBM2e	Enables loading of larger vision-language models
Memory Bandwidth	Up to 3.2 TB/s	Faster data transfer for image processing and token generation
L2 Cache	Up to 408MB	Reduces memory access latency for attention mechanisms
XMX Engines	16 per Xe core	Accelerates matrix multiplications in transformer architectures
Int8 Operations	Up to 256 ops/clock	Enables efficient quantized inference for VLMs

TGI + Intel® XPU Technical Synergies

Hardware-Aware Optimizations: The official TGI Docker images for Intel® XPUs are built with Intel® oneAPI Base Toolkit optimizations, leveraging SYCL and oneDNN for efficient tensor parallel operations that fully utilize the Xe architecture's capabilities.
Specialized Memory Management: TGI's memory management is optimized for Intel®'s unified memory architecture, enabling efficient handling of both vision feature extraction and language generation tasks.
Advanced Quantization Support: TGI on Intel® XPUs supports BF16 and INT8 quantization, utilizing the XMX engines substantial throughput improvement while maintaining model accuracy (depends on model architecture).
Continuous Batching Architecture: TGI implements a token-based scheduling system that maximizes GPU utilization by dynamically batching requests, achieving up to 3x higher throughput compared to static batching approaches.
Optimized Attention Mechanisms: TGI uses Intel®-optimized attention implementations that improve performance for long sequences, especially when using advanced kernels like FlashAttention.
Tensor Parallelism: For multi-GPU setups, TGI can distribute model layers across multiple Intel® GPUs, enabling inference for models larger than single-GPU memory capacity.
VLM-Specific Pipeline Optimization: TGI's architecture efficiently handles the multimodal data flow required by VLMs, with optimized pipelines for image encoding and subsequent text generation.

This technical guide focuses on deploying TGI with VLMs on Intel® XPUs. For other vision model architectures (e.g., image classification, object detection) or when advanced graph compilation via OpenVINO™ is required, alternative serving solutions might be more appropriate depending on your specific performance requirements and deployment constraints.

Step 1: Prepare Your Environment

For this guide, we'll use an Intel® Max Series GPU environment. If you're using Intel® Tiber™ AI Cloud, follow these steps:

Visit the Intel ® Tiber™ AI Cloud
Log into your account.
Click Compute -> Instances -> Launch Instance from the menu at left.
Select the instance type: Intel® Max Series GPU VM.
Complete Instance configuration.
For Machine image, use default.
Add Instance name.
Choose an option to connect.
One-Click connection Recommended.
Public Keys
Click Launch to launch your instance.

If you're using Intel® Max Series GPUs in another environment, ensure you have proper access to the GPU and the required drivers installed.

(view in My Videos)

(Animated GIF - may appear static in some viewers)

Video: Intel® Tiber AI Cloud interface showing the instance launch process

View on GitHub if video doesn't load

Step 2: Connect to the instance

Once instance is ready, Click on Instance Name -> How to Connect via SSH -> Copy the SSH Command.
Connect to the instance and check for available devices.

source /opt/intel/oneapi/setvars.sh #activates oneAPI environment

sycl-ls #lists the available devices

(view in My Videos)

(Animated GIF - may appear static in some viewers)

Video: Terminal session showing device discovery and oneAPI environment setup on Intel® Max Series GPU

View on GitHub if video doesn't load

Step 3: Launch Container and Serve the Model

The Text Generation Inference (TGI) framework provides Intel® XPU-optimized containers that leverage the full capabilities of Intel® Data Center GPU Max Series hardware. For the latest version information, refer to the TGI Installation Guide for Intel.

1. Configure Environment Variables

The TGI team maintains Docker images specifically optimized for Intel® XPUs with Intel® oneAPI Base Toolkit acceleration libraries. We'll use a specific version for reproducibility in this guide, though you can also use the latest-intel-xpu tag for the most recent builds with the latest optimizations.

# Define container image and name
export DOCKER_IMAGE=ghcr.io/huggingface/text-generation-inference:3.2.0-intel-xpu
export CONTAINER_NAME=tgi-xpu-qwen-vl

# Optional: Configure XPU-specific environment variables
export SYCL_CACHE_PERSISTENT=1  # Enable persistent SYCL kernel cache for faster startup
export SYCL_PROGRAM_COMPILE_OPTIONS="-ze-opt-large-register-file"  # Optimize register allocation

2. Launch the Container with Intel XPU Configuration

docker run -it \
  --privileged \
  --device=/dev/dri \
  --network=host \
  --shm-size=16g \
  --env PREFIX_CACHING=0 \
  --env SYCL_CACHE_PERSISTENT=1 \
  --env SYCL_PROGRAM_COMPILE_OPTIONS="-ze-opt-large-register-file" \
  --name $CONTAINER_NAME \
  -v ${HF_CACHE_DIR:-$HOME/.cache/huggingface}:/root/.cache/huggingface:rw \
  -v /tmp/sycl-cache:/tmp/sycl-cache:rw \
  -e HF_HOME=/root/.cache/huggingface \
  --entrypoint=/bin/bash \
  $DOCKER_IMAGE

Technical Breakdown of Container Configuration

Parameter	Value	Technical Purpose
`--privileged`	-	Grants extended privileges to access Intel® GPU hardware directly (for testing)
`--device=/dev/dri`	-	Exposes Direct Rendering Infrastructure for GPU access
`--network=host`	-	Uses host network stack for optimal performance without NAT overhead
`--shm-size`	16g	Allocates shared memory for inter-process communication in TGI's worker architecture
`PREFIX_CACHING`	0	Disables KV-cache prefix optimization for more predictable latency
`SYCL_CACHE_PERSISTENT`	1	Enables persistent kernel caching to avoid JIT compilation overhead
`SYCL_PROGRAM_COMPILE_OPTIONS`	-ze-opt-large-register-file	Optimizes register allocation for transformer workloads
`-v /tmp/sycl-cache`	/tmp/sycl-cache:rw	Persists SYCL kernel cache between container restarts

The container configuration is specifically tuned for Intel XPU architecture. The /dev/dri device mapping provides direct access to the Intel GPU, while the SYCL environment variables optimize the oneAPI DPC++ compiler's behavior for transformer model inference. The shared memory allocation (16GB) is sized to accommodate the KV cache requirements for handling multiple concurrent requests with the Qwen2.5-VL model.

3. Configure and Launch TGI Server with Intel-Optimized Parameters

Inside the container's bash prompt, we'll configure the model and launch parameters optimized for Intel XPU architecture:

# Define model and configuration
export MODEL_ID=Qwen/Qwen2.5-VL-7B-Instruct
export XPU_VISIBLE_DEVICES=0  # Target specific XPU if multiple are available

# Monitor XPU utilization in a separate terminal
xpu-smi stats -d 0 -b

Now, launch the TGI server with Intel® XPU-optimized parameters:

text-generation-launcher \
  --model-id ${MODEL_ID} \
  --dtype bfloat16 \
  --max-batch-prefill-tokens 2048 \
  --max-input-length 4096 \
  --max-total-tokens 8192 \
  --max-concurrent-requests 128 \
  --sharded false \
  --cuda-graphs 0 \
  --port 8888

Technical Explanation of TGI Launch Parameters

Parameter	Value	Technical Significance
`--model-id`	Qwen/Qwen2.5-VL-7B-Instruct	Specifies the multimodal model architecture
`--dtype`	bfloat16	Utilizes Intel® XPU's native BF16 support for optimal performance/accuracy tradeoff
`--max-batch-prefill-tokens`	2048	Configures token batch size for the prefill phase to maximize XMX utilization
`--max-input-length`	4096	Sets maximum context window for input (including image embeddings)
`--max-total-tokens`	8192	Defines total token limit (input + output) per request
`--max-concurrent-requests`	128	Optimizes request queue depth for Intel® XPU's parallel execution units
`--sharded`	false	Disables model sharding for single-GPU deployment
`--cuda-graphs`	0	Disables CUDA-specific optimizations that aren't applicable to Intel® XPUs

The bfloat16 data type is particularly important for Intel® XPUs as it leverages the native BF16 support in the XMX matrix engines, providing up to 4x throughput improvement compared to FP32 while maintaining model accuracy. The batch configuration parameters are tuned to maximize utilization of the 128 Xe cores and 408MB L2 cache available in the Intel® Data Center GPU Max Series.
Wait for the model to download (if it's the first time) and for TGI to indicate it's ready to accept connections. You can monitor the XPU utilization in a separate terminal using the xpu-smi tool to ensure the hardware is being efficiently utilized.

(view in My Videos)

(Animated GIF - may appear static in some viewers)

Video: Terminal session showing the TGI container launch and model loading process

View on GitHub if video doesn't load

Step 4: Test Model Outputs

Once the TGI server is running and the model is loaded, you can test it.

Using `curl`

The inputs field for VLMs in TGI typically expects image URLs or Base64 encoded images embedded in the prompt string using Markdown-like syntax (`!file:///Users/runnikri/Downloads/reference-implementation-main/blogs/image_url_or_base64`).

curl -N 127.0.0.1:8888/generate_stream \
-X POST \
-d '{"inputs":"![]([https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What) is this a picture of?\n\n","parameters":{"max_new_tokens":256, "seed": 42}}' \
-H 'Content-Type: application/json'

You should see a streamed JSON output with the model's description of the image.

Example of a rabbit image used for VLM testing

Sample image used for testing the Vision-Language Model

View original image on GitHub if image doesn't load

# Actual model response (truncated for brevity)
data: {"index":64,"token":{"id":6109,"text":" scene","logprob":-1.15625,"special":false},"generated_text":null,"details":null}
data: {"index":65,"token":{"id":794,"text":" is","logprob":-1.0625,"special":false},"generated_text":null,"details":null}
data: {"index":66,"token":{"id":50005,"text":" reminiscent","logprob":-2.515625,"special":false},"generated_text":null,"details":null}
data: {"index":67,"token":{"id":319,"text":" of","logprob":-0.000000202656,"special":false},"generated_text":null,"details":null}
data: {"index":68,"token":{"id":260,"text":" a","logprob":-0.0078125,"special":false},"generated_text":null,"details":null}
data: {"index":69,"token":{"id":8038,"text":" science","logprob":-0.69921875,"special":false},"generated_text":null,"details":null}
data: {"index":70,"token":{"id":16909,"text":" fiction","logprob":-0.00418906,"special":false},"generated_text":null,"details":null}
...
data: {"index":86,"token":{"id":191083,"text":"<|endoftext|>","logprob":-0.859375,"special":true},"generated_text":"This is a picture of a rabbit dressed as an astronaut on the surface of Mars. The rabbit is wearing a detailed space suit with blue and white accents, complete with a helmet and various buttons and panels. The background features a reddish-brown landscape typical of Mars, with rocky formations and a desolate terrain. The image has a whimsical and fantastical quality, blending elements of science fiction with cute animal exploration.","details":null}

Example of the streaming JSON response from the TGI server

View original response on GitHub

Using a Python Client (Optional)

For application integration, you'll likely use a Python client. Here's a simple example using the requests library, you can try by running this on the host:

#!/usr/bin/env python3

"""TGI Vision-Language Model Client for Intel XPUs

A simple client for interacting with Text Generation Inference (TGI)
serving Vision-Language Models on Intel GPUs.

Usage:
    python tgi_vlm_client.py
"""

import json
import requests


def query_vision_model(image_url, question, endpoint="http://127.0.0.1:8888"):
    """Query the deployed vision-language model with an image and question."""
    prompt = f"![]({image_url}){question}\n\n"
    
    # Set up the request
    url = f"{endpoint.rstrip('/')}/generate_stream"
    headers = {"Content-Type": "application/json"}
    payload = {
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": 256,
            "seed": 42,
            "temperature": 0.7
        }
    }
    
    # Make the request
    with requests.post(url, json=payload, headers=headers, stream=True) as response:
        response.raise_for_status()
        print(f"\nAnalyzing image: {image_url}")
        print(f"Question: {question}\n")
        print("Response: ", end="")
        
        # Track if we've received the final generated text
        full_response = ""
        for line in response.iter_lines():
            if not line:
                continue
            line = line.decode('utf-8')
            if not line.startswith("data:"):
                continue
                
            try:
                # Parse the JSON data
                json_data = json.loads(line[5:])  # Skip "data:" prefix
                # Check for token text
                if "token" in json_data and "text" in json_data["token"]:
                    token_text = json_data["token"]["text"]
                    print(token_text, end="", flush=True)
                # Check for final generated text (comes with the last token)
                if "generated_text" in json_data and json_data["generated_text"]:
                    full_response = json_data["generated_text"]
            except json.JSONDecodeError:
                pass
        print("\n\n--- Generation complete ---")
        # Print the full response if available
        if full_response:
            print(f"\nFull response: {full_response}")


# Simple example usage
if __name__ == "__main__":
    # Sample image and question
    image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
    question = "What is this a picture of?"
    # Query the model
    query_vision_model(image_url, question)

Advanced Troubleshooting & Performance Optimization

Common Issues and Technical Solutions

Issue	Technical Diagnosis	Resolution
Docker Errors (pulling or running)	Container registry authentication or network connectivity issues	Verify image name (`ghcr.io/huggingface/text-generation-inference:3.2.0-intel-xpu`) and check Docker registry connectivity with `docker info`
`sycl-ls` Shows No Devices	Intel GPU driver initialization failure or Intel® oneAPI Base Toolkit runtime issues	1. Verify Intel Max Series GPU VM allocation<br>2. Check driver status: `ls -la /dev/dri`<br>3. Source oneAPI environment: `/opt/intel/oneapi/setvars.sh`<br>4. Verify Level Zero driver: `ze_info`
Container Fails to Start	Resource allocation or device access permission issues	1. Check Docker logs: `docker logs $CONTAINER_NAME`<br>2. Verify device permissions: `ls -la /dev/dri`<br>3. Check system resource limits: `ulimit -a`
ut of Memory (OOM)	Insufficient GPU memory or shared memory allocation	1. Increase `--shm-size` to at least 16GB<br>2. Monitor memory with `xpu-smi dump -d 0 -m 8`<br>3. Consider model quantization or sharding<br>4. Try smaller batch sizes or sequence lengths
Model Fails to Load	Model architecture compatibility or resource constraints	1. Check TGI logs within container<br>2. Verify disk space: `df -h $HOME/.cache/huggingface`<br>3. Check model compatibility with Intel XPUs<br>4. Verify HF_TOKEN if using gated models
Slow Inference Performance	Suboptimal configuration or resource contention	1. Ensure `--dtype bfloat16` is used<br>2. Monitor XPU utilization: `xpu-smi stats -d 0 -b`<br>3. Check memory bandwidth: `xpu-smi dump -d 0 -m 18`<br>4. Optimize batch size and prefill parameters<br>5. Enable kernel caching with `SYCL_CACHE_PERSISTENT=1`
SYCL Compilation Errors	Kernel compilation issues with Intel® oneAPI Base Toolkit runtime	1. Check SYCL compilation logs<br>2. Verify oneAPI version compatibility<br>3. Clear SYCL cache: `rm -rf /tmp/sycl-cache/*`<br>4. Update to latest Intel GPU driver

Performance Optimization Techniques

XPU Profiling: Use `xpu-smi` with custom metrics to identify bottlenecks:

xpu-smi dump -d 0 -m 0,1,10,13,14,15,8 -i 1

Memory Hierarchy Optimization: Tune batch sizes to maximize L2 cache utilization (408MB on Max Series):

# Adjust these parameters based on your model size and request patterns
  --max-batch-prefill-tokens 2048 \
  --max-input-length 4096

Kernel Optimization: Enable persistent kernel caching to eliminate JIT compilation overhead:

export SYCL_CACHE_PERSISTENT=1
  export SYCL_PROGRAM_COMPILE_OPTIONS="-ze-opt-large-register-file"

Concurrent Request Tuning: Optimize for Intel XPU's parallel execution units:

--max-concurrent-requests 128

Conclusion: Accelerating AI Innovation with Intel® Technologies

Technical Advantages of Intel® Data Center GPU Max Series for VLM Inference

The Intel® Data Center GPU Max Series provides a compelling platform for Vision-Language Model inference workloads. With its architecture featuring up to 128 Xe cores, 408MB of L2 cache, and native BF16 support through XMX engines, these GPUs deliver the computational density required for complex multimodal AI tasks. The Intel® oneAPI Base Toolkit software stack, including SYCL and oneDNN, enables frameworks like TGI to use Intel hardware effectively otwhile maintaining a standard programming model.

Properly configured Intel® XPUs can offer strong inference performance, especially when tuned for memory and compute efficiency, particularly when leveraging the large L2 cache for attention mechanism computation and the XMX engines for matrix operations that dominate transformer architectures.

Intel® Tiber™ AI Cloud

The Intel® Tiber™ AI Cloud provides access to Intel® hardware and software technologies, offering:

Advanced Compute Options: Access to various compute instances including those with Intel® Data Center GPU Max Series accelerators
Optimized Software Stack: Pre-configured environments with Intel®-optimized frameworks and libraries
Comprehensive Resources: Tools and resources to help you work with Intel® technologies
Flexible Usage Options: Various options to leverage Intel® hardware capabilities for your AI workloads

You can learn more about the platform at cloud.intel.com.

Intel® Liftoff Program: Accelerating AI Startups

For startups developing AI solutions, the Intel® Liftoff program provides specialized support to accelerate innovation:

Technical Resources: Access to Intel®'s latest hardware and software technologies, including preferential project-based credits for Intel® Tiber AI Cloud
Expert Mentorship: Guidance from Intel® engineers and AI specialists to optimize solutions
Go-to-Market Support: Opportunities for co-marketing and ecosystem integration
Community Access: Connection to a network of AI innovators and potential partners

The program helps AI startups access technical resources and optimize their solutions on Intel® hardware. Unlike traditional accelerators, Intel® Liftoff takes no equity and focuses on providing technical and infrastructure support.

For startups working on AI products: We invite you to apply to the Intel® Liftoff program. The program offers preferential project-based credits for Intel® Tiber AI Cloud and other benefits specifically designed to help AI startups scale their technical capabilities.

Next Steps

This technical guide demonstrates how to deploy VLMs using TGI on Intel® XPUs. You can further expore:

Model Quantization: Further optimize performance with INT8 quantization using Intel®'s quantization tools
Multi-GPU Scaling: Configure TGI for distributed inference across multiple Intel® GPUs
Custom Vision Pipelines: Utilize OpenVINO™ inference engine for specialized vision preprocessing
Production Deployment: Implement monitoring, rate-limiting, and high-availability configurations

By leveraging Intel®'s comprehensive AI stack - from hardware acceleration to optimized software, developers can build sophisticated vision-language applications with the performance and efficiency needed for their specific use cases.

Related resources

Intel® Tiber™ AI Cloud - Cloud platform for AI development and deployment

Intel® Gaudi® 2 AI accelerator - High-performance AI training processor designed for deep learning workloads

Intel® Data Center GPU Max Series - High-performance GPUs tailored for intense data center applications, designed to accelerate AI and HPC workloads

Intel® Xeon® CPU Max Series - A high-performance variant of Intel Xeon processors, designed to handle memory-intensive and compute-heavy tasks