Authors: Rajashekar Kasturi (Senior AI Engineer, Intel® Liftoff) and Rahul Unnikrishnan Nair (Head of Engineering, Intel® Liftoff)
This technical guide, written by the Intel® Liftoff technical mentors, demonstrates how to deploy and serve state-of-the-art VLMs using Text Generation Inference (TGI) with hardware-specific optimizations for performant inference on Intel® XPUs. The examples in this guide use the Intel® Tiber™ AI Cloud environment, but the techniques and optimizations can be applied to Intel® XPUs in various deployment scenarios.
Intel®'s Data Center GPU Max Series (XPU) features up to 128 Xe cores and over 100 billion transistors in a single package. These GPUs are designed to accelerate AI workloads such as vision-language models, using features like Intel® Xe Matrix Extensions (XMX) and large L2 cache configurations (up to 408MB).
In this technical walk-through, we'll guide you through the complete workflow of setting up your Intel® Max Series GPU virtual machine, launching a TGI container specifically optimized for Intel® XPUs with Intel® oneAPI Base Toolkit acceleration, and serving a production-ready VLM like Qwen2.5-VL. You'll learn how TGI's supports efficient inference through hardware-aware optimizations for Intel® XPUs.
For more information on products, pricing and solutions visit: https://ai.cloud.intel.com/
Why Text Generation Inference (TGI) for VLMs on Intel® Data Center GPU Max Series?
Text Generation Inference (TGI) by Hugging Face is a versatile, production-ready serving solution engineered specifically for Large Language Models (LLMs) and, increasingly, Vision-Language Models (VLMs). When deployed on Intel® GPUs, TGI delivers exceptional performance through hardware-specific optimizations:
Intel® Data Center GPU Max Series Architecture
Feature | Specification | Benefit for VLM Inference |
Xe Cores | Up to 128 Xe cores | Parallel processing of vision and language components |
Memory | Up to 128GB HBM2e | Enables loading of larger vision-language models |
Memory Bandwidth | Up to 3.2 TB/s | Faster data transfer for image processing and token generation |
L2 Cache | Up to 408MB | Reduces memory access latency for attention mechanisms |
XMX Engines | 16 per Xe core | Accelerates matrix multiplications in transformer architectures |
Int8 Operations | Up to 256 ops/clock | Enables efficient quantized inference for VLMs |
TGI + Intel® XPU Technical Synergies
- Hardware-Aware Optimizations: The official TGI Docker images for Intel® XPUs are built with Intel® oneAPI Base Toolkit optimizations, leveraging SYCL and oneDNN for efficient tensor parallel operations that fully utilize the Xe architecture's capabilities.
- Specialized Memory Management: TGI's memory management is optimized for Intel®'s unified memory architecture, enabling efficient handling of both vision feature extraction and language generation tasks.
- Advanced Quantization Support: TGI on Intel® XPUs supports BF16 and INT8 quantization, utilizing the XMX engines substantial throughput improvement while maintaining model accuracy (depends on model architecture).
- Continuous Batching Architecture: TGI implements a token-based scheduling system that maximizes GPU utilization by dynamically batching requests, achieving up to 3x higher throughput compared to static batching approaches.
- Optimized Attention Mechanisms: TGI uses Intel®-optimized attention implementations that improve performance for long sequences, especially when using advanced kernels like FlashAttention.
- Tensor Parallelism: For multi-GPU setups, TGI can distribute model layers across multiple Intel® GPUs, enabling inference for models larger than single-GPU memory capacity.
- VLM-Specific Pipeline Optimization: TGI's architecture efficiently handles the multimodal data flow required by VLMs, with optimized pipelines for image encoding and subsequent text generation.
This technical guide focuses on deploying TGI with VLMs on Intel® XPUs. For other vision model architectures (e.g., image classification, object detection) or when advanced graph compilation via OpenVINO™ is required, alternative serving solutions might be more appropriate depending on your specific performance requirements and deployment constraints.
Step 1: Prepare Your Environment
For this guide, we'll use an Intel® Max Series GPU environment. If you're using Intel® Tiber™ AI Cloud, follow these steps:
- Visit the Intel® Tiber™ AI Cloud
- Log into your account.
- Click Compute -> Instances -> Launch Instance from the menu at left.
- Select the instance type: Intel® Max Series GPU VM.
- Complete Instance configuration.
- For Machine image, use default.
- Add Instance name.
- Choose an option to connect.
- One-Click connection Recommended.
- Public Keys
- Click Launch to launch your instance.
If you're using Intel® Max Series GPUs in another environment, ensure you have proper access to the GPU and the required drivers installed.
(Animated GIF - may appear static in some viewers)
Video: Intel® Tiber AI Cloud interface showing the instance launch process
View on GitHub if video doesn't load
Step 2: Connect to the instance
- Once instance is ready, Click on Instance Name -> How to Connect via SSH -> Copy the SSH Command.
- Connect to the instance and check for available devices.
source /opt/intel/oneapi/setvars.sh #activates oneAPI environment
sycl-ls #lists the available devices
(Animated GIF - may appear static in some viewers)
Video: Terminal session showing device discovery and oneAPI environment setup on Intel® Max Series GPU
View on GitHub if video doesn't load
Step 3: Launch Container and Serve the Model
The Text Generation Inference (TGI) framework provides Intel® XPU-optimized containers that leverage the full capabilities of Intel® Data Center GPU Max Series hardware. For the latest version information, refer to the TGI Installation Guide for Intel.
1. Configure Environment Variables
The TGI team maintains Docker images specifically optimized for Intel® XPUs with Intel® oneAPI Base Toolkit acceleration libraries. We'll use a specific version for reproducibility in this guide, though you can also use the latest-intel-xpu tag for the most recent builds with the latest optimizations.
# Define container image and name
export DOCKER_IMAGE=ghcr.io/huggingface/text-generation-inference:3.2.0-intel-xpu
export CONTAINER_NAME=tgi-xpu-qwen-vl
# Optional: Configure XPU-specific environment variables
export SYCL_CACHE_PERSISTENT=1 # Enable persistent SYCL kernel cache for faster startup
export SYCL_PROGRAM_COMPILE_OPTIONS="-ze-opt-large-register-file" # Optimize register allocation
2. Launch the Container with Intel XPU Configuration
docker run -it \
--privileged \
--device=/dev/dri \
--network=host \
--shm-size=16g \
--env PREFIX_CACHING=0 \
--env SYCL_CACHE_PERSISTENT=1 \
--env SYCL_PROGRAM_COMPILE_OPTIONS="-ze-opt-large-register-file" \
--name $CONTAINER_NAME \
-v ${HF_CACHE_DIR:-$HOME/.cache/huggingface}:/root/.cache/huggingface:rw \
-v /tmp/sycl-cache:/tmp/sycl-cache:rw \
-e HF_HOME=/root/.cache/huggingface \
--entrypoint=/bin/bash \
$DOCKER_IMAGE
Technical Breakdown of Container Configuration
Parameter | Value | Technical Purpose |
`--privileged` | - | Grants extended privileges to access Intel® GPU hardware directly (for testing) |
`--device=/dev/dri` | - | Exposes Direct Rendering Infrastructure for GPU access |
`--network=host` | - | Uses host network stack for optimal performance without NAT overhead |
`--shm-size` | 16g | Allocates shared memory for inter-process communication in TGI's worker architecture |
`PREFIX_CACHING` | 0 | Disables KV-cache prefix optimization for more predictable latency |
`SYCL_CACHE_PERSISTENT` | 1 | Enables persistent kernel caching to avoid JIT compilation overhead |
`SYCL_PROGRAM_COMPILE_OPTIONS` | -ze-opt-large-register-file | Optimizes register allocation for transformer workloads |
`-v /tmp/sycl-cache` | /tmp/sycl-cache:rw | Persists SYCL kernel cache between container restarts |
The container configuration is specifically tuned for Intel XPU architecture. The /dev/dri device mapping provides direct access to the Intel GPU, while the SYCL environment variables optimize the oneAPI DPC++ compiler's behavior for transformer model inference. The shared memory allocation (16GB) is sized to accommodate the KV cache requirements for handling multiple concurrent requests with the Qwen2.5-VL model.
3. Configure and Launch TGI Server with Intel-Optimized Parameters
Inside the container's bash prompt, we'll configure the model and launch parameters optimized for Intel XPU architecture:
# Define model and configuration
export MODEL_ID=Qwen/Qwen2.5-VL-7B-Instruct
export XPU_VISIBLE_DEVICES=0 # Target specific XPU if multiple are available
# Monitor XPU utilization in a separate terminal
xpu-smi stats -d 0 -b
Now, launch the TGI server with Intel® XPU-optimized parameters:
text-generation-launcher \
--model-id ${MODEL_ID} \
--dtype bfloat16 \
--max-batch-prefill-tokens 2048 \
--max-input-length 4096 \
--max-total-tokens 8192 \
--max-concurrent-requests 128 \
--sharded false \
--cuda-graphs 0 \
--port 8888
Technical Explanation of TGI Launch Parameters
Parameter | Value | Technical Significance |
`--model-id` | Qwen/Qwen2.5-VL-7B-Instruct | Specifies the multimodal model architecture |
`--dtype` | bfloat16 | Utilizes Intel® XPU's native BF16 support for optimal performance/accuracy tradeoff |
`--max-batch-prefill-tokens` | 2048 | Configures token batch size for the prefill phase to maximize XMX utilization |
`--max-input-length` | 4096 | Sets maximum context window for input (including image embeddings) |
`--max-total-tokens` | 8192 | Defines total token limit (input + output) per request |
`--max-concurrent-requests` | 128 | Optimizes request queue depth for Intel® XPU's parallel execution units |
`--sharded` | false | Disables model sharding for single-GPU deployment |
`--cuda-graphs` | 0 | Disables CUDA-specific optimizations that aren't applicable to Intel® XPUs |
The bfloat16 data type is particularly important for Intel® XPUs as it leverages the native BF16 support in the XMX matrix engines, providing up to 4x throughput improvement compared to FP32 while maintaining model accuracy. The batch configuration parameters are tuned to maximize utilization of the 128 Xe cores and 408MB L2 cache available in the Intel® Data Center GPU Max Series.
Wait for the model to download (if it's the first time) and for TGI to indicate it's ready to accept connections. You can monitor the XPU utilization in a separate terminal using the xpu-smi tool to ensure the hardware is being efficiently utilized.
(Animated GIF - may appear static in some viewers)
Video: Terminal session showing the TGI container launch and model loading process
View on GitHub if video doesn't load
Step 4: Test Model Outputs
Once the TGI server is running and the model is loaded, you can test it.
Using `curl`
The inputs field for VLMs in TGI typically expects image URLs or Base64 encoded images embedded in the prompt string using Markdown-like syntax (`!file:///Users/runnikri/Downloads/reference-implementation-main/blogs/image_url_or_base64`).
curl -N 127.0.0.1:8888/generate_stream \
-X POST \
-d '{"inputs":"What](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What) is this a picture of?\n\n","parameters":{"max_new_tokens":256, "seed": 42}}' \
-H 'Content-Type: application/json'
You should see a streamed JSON output with the model's description of the image.
Example of a rabbit image used for VLM testing
Sample image used for testing the Vision-Language Model
View original image on GitHub if image doesn't load
# Actual model response (truncated for brevity)
data: {"index":64,"token":{"id":6109,"text":" scene","logprob":-1.15625,"special":false},"generated_text":null,"details":null}
data: {"index":65,"token":{"id":794,"text":" is","logprob":-1.0625,"special":false},"generated_text":null,"details":null}
data: {"index":66,"token":{"id":50005,"text":" reminiscent","logprob":-2.515625,"special":false},"generated_text":null,"details":null}
data: {"index":67,"token":{"id":319,"text":" of","logprob":-0.000000202656,"special":false},"generated_text":null,"details":null}
data: {"index":68,"token":{"id":260,"text":" a","logprob":-0.0078125,"special":false},"generated_text":null,"details":null}
data: {"index":69,"token":{"id":8038,"text":" science","logprob":-0.69921875,"special":false},"generated_text":null,"details":null}
data: {"index":70,"token":{"id":16909,"text":" fiction","logprob":-0.00418906,"special":false},"generated_text":null,"details":null}
...
data: {"index":86,"token":{"id":191083,"text":"<|endoftext|>","logprob":-0.859375,"special":true},"generated_text":"This is a picture of a rabbit dressed as an astronaut on the surface of Mars. The rabbit is wearing a detailed space suit with blue and white accents, complete with a helmet and various buttons and panels. The background features a reddish-brown landscape typical of Mars, with rocky formations and a desolate terrain. The image has a whimsical and fantastical quality, blending elements of science fiction with cute animal exploration.","details":null}
Example of the streaming JSON response from the TGI server
View original response on GitHub
Using a Python Client (Optional)
For application integration, you'll likely use a Python client. Here's a simple example using the requests library, you can try by running this on the host:
#!/usr/bin/env python3
"""TGI Vision-Language Model Client for Intel XPUs
A simple client for interacting with Text Generation Inference (TGI)
serving Vision-Language Models on Intel GPUs.
Usage:
python tgi_vlm_client.py
"""
import json
import requests
def query_vision_model(image_url, question, endpoint="http://127.0.0.1:8888"):
"""Query the deployed vision-language model with an image and question."""
prompt = f"{question}\n\n"
# Set up the request
url = f"{endpoint.rstrip('/')}/generate_stream"
headers = {"Content-Type": "application/json"}
payload = {
"inputs": prompt,
"parameters": {
"max_new_tokens": 256,
"seed": 42,
"temperature": 0.7
}
}
# Make the request
with requests.post(url, json=payload, headers=headers, stream=True) as response:
response.raise_for_status()
print(f"\nAnalyzing image: {image_url}")
print(f"Question: {question}\n")
print("Response: ", end="")
# Track if we've received the final generated text
full_response = ""
for line in response.iter_lines():
if not line:
continue
line = line.decode('utf-8')
if not line.startswith("data:"):
continue
try:
# Parse the JSON data
json_data = json.loads(line[5:]) # Skip "data:" prefix
# Check for token text
if "token" in json_data and "text" in json_data["token"]:
token_text = json_data["token"]["text"]
print(token_text, end="", flush=True)
# Check for final generated text (comes with the last token)
if "generated_text" in json_data and json_data["generated_text"]:
full_response = json_data["generated_text"]
except json.JSONDecodeError:
pass
print("\n\n--- Generation complete ---")
# Print the full response if available
if full_response:
print(f"\nFull response: {full_response}")
# Simple example usage
if __name__ == "__main__":
# Sample image and question
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
question = "What is this a picture of?"
# Query the model
query_vision_model(image_url, question)
Advanced Troubleshooting & Performance Optimization
Common Issues and Technical Solutions
Issue | Technical Diagnosis | Resolution |
Docker Errors (pulling or running) | Container registry authentication or network connectivity issues | Verify image name (`ghcr.io/huggingface/text-generation-inference:3.2.0-intel-xpu`) and check Docker registry connectivity with `docker info` |
`sycl-ls` Shows No Devices | Intel GPU driver initialization failure or Intel® oneAPI Base Toolkit runtime issues | 1. Verify Intel Max Series GPU VM allocation<br>2. Check driver status: `ls -la /dev/dri`<br>3. Source oneAPI environment: `/opt/intel/oneapi/setvars.sh`<br>4. Verify Level Zero driver: `ze_info` |
Container Fails to Start | Resource allocation or device access permission issues | 1. Check Docker logs: `docker logs $CONTAINER_NAME`<br>2. Verify device permissions: `ls -la /dev/dri`<br>3. Check system resource limits: `ulimit -a` |
ut of Memory (OOM) | Insufficient GPU memory or shared memory allocation | 1. Increase `--shm-size` to at least 16GB<br>2. Monitor memory with `xpu-smi dump -d 0 -m 8`<br>3. Consider model quantization or sharding<br>4. Try smaller batch sizes or sequence lengths |
Model Fails to Load | Model architecture compatibility or resource constraints | 1. Check TGI logs within container<br>2. Verify disk space: `df -h $HOME/.cache/huggingface`<br>3. Check model compatibility with Intel XPUs<br>4. Verify HF_TOKEN if using gated models |
Slow Inference Performance | Suboptimal configuration or resource contention | 1. Ensure `--dtype bfloat16` is used<br>2. Monitor XPU utilization: `xpu-smi stats -d 0 -b`<br>3. Check memory bandwidth: `xpu-smi dump -d 0 -m 18`<br>4. Optimize batch size and prefill parameters<br>5. Enable kernel caching with `SYCL_CACHE_PERSISTENT=1` |
SYCL Compilation Errors | Kernel compilation issues with Intel® oneAPI Base Toolkit runtime | 1. Check SYCL compilation logs<br>2. Verify oneAPI version compatibility<br>3. Clear SYCL cache: `rm -rf /tmp/sycl-cache/*`<br>4. Update to latest Intel GPU driver |
Performance Optimization Techniques
- XPU Profiling: Use `xpu-smi` with custom metrics to identify bottlenecks:
xpu-smi dump -d 0 -m 0,1,10,13,14,15,8 -i 1
- Memory Hierarchy Optimization: Tune batch sizes to maximize L2 cache utilization (408MB on Max Series):
# Adjust these parameters based on your model size and request patterns
--max-batch-prefill-tokens 2048 \
--max-input-length 4096
- Kernel Optimization: Enable persistent kernel caching to eliminate JIT compilation overhead:
export SYCL_CACHE_PERSISTENT=1
export SYCL_PROGRAM_COMPILE_OPTIONS="-ze-opt-large-register-file"
- Concurrent Request Tuning: Optimize for Intel XPU's parallel execution units:
--max-concurrent-requests 128
Conclusion: Accelerating AI Innovation with Intel® Technologies
Technical Advantages of Intel® Data Center GPU Max Series for VLM Inference
The Intel® Data Center GPU Max Series provides a compelling platform for Vision-Language Model inference workloads. With its architecture featuring up to 128 Xe cores, 408MB of L2 cache, and native BF16 support through XMX engines, these GPUs deliver the computational density required for complex multimodal AI tasks. The Intel® oneAPI Base Toolkit software stack, including SYCL and oneDNN, enables frameworks like TGI to use Intel hardware effectively otwhile maintaining a standard programming model.
Properly configured Intel® XPUs can offer strong inference performance, especially when tuned for memory and compute efficiency, particularly when leveraging the large L2 cache for attention mechanism computation and the XMX engines for matrix operations that dominate transformer architectures.
Intel® Tiber™ AI Cloud
The Intel® Tiber™ AI Cloud provides access to Intel® hardware and software technologies, offering:
- Advanced Compute Options: Access to various compute instances including those with Intel® Data Center GPU Max Series accelerators
- Optimized Software Stack: Pre-configured environments with Intel®-optimized frameworks and libraries
- Comprehensive Resources: Tools and resources to help you work with Intel® technologies
- Flexible Usage Options: Various options to leverage Intel® hardware capabilities for your AI workloads
You can learn more about the platform at cloud.intel.com.
Intel® Liftoff Program: Accelerating AI Startups
For startups developing AI solutions, the Intel® Liftoff program provides specialized support to accelerate innovation:
- Technical Resources: Access to Intel®'s latest hardware and software technologies, including preferential project-based credits for Intel® Tiber AI Cloud
- Expert Mentorship: Guidance from Intel® engineers and AI specialists to optimize solutions
- Go-to-Market Support: Opportunities for co-marketing and ecosystem integration
- Community Access: Connection to a network of AI innovators and potential partners
The program helps AI startups access technical resources and optimize their solutions on Intel® hardware. Unlike traditional accelerators, Intel® Liftoff takes no equity and focuses on providing technical and infrastructure support.
For startups working on AI products: We invite you to apply to the Intel® Liftoff program. The program offers preferential project-based credits for Intel® Tiber AI Cloud and other benefits specifically designed to help AI startups scale their technical capabilities.
Next Steps
This technical guide demonstrates how to deploy VLMs using TGI on Intel® XPUs. You can further expore:
- Model Quantization: Further optimize performance with INT8 quantization using Intel®'s quantization tools
- Multi-GPU Scaling: Configure TGI for distributed inference across multiple Intel® GPUs
- Custom Vision Pipelines: Utilize OpenVINO™ inference engine for specialized vision preprocessing
- Production Deployment: Implement monitoring, rate-limiting, and high-availability configurations
By leveraging Intel®'s comprehensive AI stack - from hardware acceleration to optimized software, developers can build sophisticated vision-language applications with the performance and efficiency needed for their specific use cases.
Related resources
Intel® Tiber™ AI Cloud - Cloud platform for AI development and deployment
Intel® Gaudi® 2 AI accelerator - High-performance AI training processor designed for deep learning workloads
Intel® Data Center GPU Max Series - High-performance GPUs tailored for intense data center applications, designed to accelerate AI and HPC workloads
Intel® Xeon® CPU Max Series - A high-performance variant of Intel Xeon processors, designed to handle memory-intensive and compute-heavy tasks
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.