Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
796 Discussions

Deploying Deepseek models on Intel® Gaudi® accelerators using vLLM

Krzysztof_W_Intel
0 0 176

Since Intel Gaudi software release 1.21.0 vLLM supports Deepseek architecture, we want to show you how to use it.

  • DeepSeek is a model that combines Mixture of Experts (MoE) with Multi-Head Latent Attention (MLA). Model weights are stored natively in FP8 format using block quantization scales for efficiency.
  • It is available in two variants, both share the same architecture and memory footprint:
    • V3: Standard model.
    • R1: Reasoning-optimized model.
  • Compatible with both Intel® Gaudi® 3 and Gaudi® 2 accelerators.

Single-Node Setup Guide for DeepSeek

Step 1: Install firmware and software stack
Instruction is based on release 1.22.0

How to update the Gaudi driver

How to update the Gaudi firmware



Step 2: Start the Docker image

docker run -it --name deepseek_server --runtime=habana \
   -e HABANA_VISIBLE_DEVICES=all \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   --cap-add=sys_nice \
   --net=host \
   --ipc=host \
   vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:latest​​

Step 3: Download and install vLLM

git clone -b "v1.22.0" https://github.com/HabanaAI/vllm-fork.git
pip install vllm-fork/​

Step 4: Prepare the environment
A) Mandatory setup:

export PT_HPU_ENABLE_LAZY_COLLECTIVES="true"
export PT_HPU_WEIGHT_SHARING=0
export PT_HPUGRAPH_DISABLE_TENSOR_CACHE=1
export PT_HPU_LAZY_MODE=1

Note:  As of Intel Gaudi software version 1.22.0, DeepSeek supports torch.compile mode, so setting PT_HPU_LAZY_MODE=1 is no longer required.

B) Optional setup (for advanced users):

For better performance and shorter warmup time, consider tuning key performance knobs as outlined in README_GAUDI.md

C) Gaudi 2-specific:

VLLM_HPU_CONVERT_TO_FP8UZ – only useful for Gaudi 2.  Scales weights to accommodate different limits of FP8 on Gaudi 2. A more detailed explanation can be found in the Intel Gaudi software documentation. Alternately, the conversion script can be used to convert weights offline.

D)  Environmental variables introduced in v1.22.0:

VLLM_HPU_FORCE_CHANNEL_FP8 - forces using per-channel quantization instead of block quantization using dynamic quantization.  Enabled by default.

--enable-expert-parallel – enables expert parallelism (EP) instead of tensor parallelism (TP) for the MoE part. Not using this parameter will reduce performance and may cause graph compilation failures.


Step 5: Start vLLM

Note: The model can be replaced with a local path to speed up loading. When it is provided as the HuggingFace name, it is downloaded.

Option 1: Run with dynamic quantization:

python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8688 \
   --block-size 128 \
   --model deepseek-ai/DeepSeek-R1 \
   --tensor-parallel-size 8 \
   --trust-remote-code  \
   --max-model-len 2048 \
   --max-num-seqs 128 \
   --gpu_memory_utilization 0.9 \
   --enable-expert-parallel

The original model with block quantization can be run similarly by adding VLLM_HPU_FORCE_CHANNEL_FP8=0. 

Parameter explanation:

  • --block-size 128 – it is recommended to use block size 128 due to the Gaudi architecture.
  • --model - it can be name in HuggingFace’s format to download it or a local path
  • --tensor-parallel-size 8 - describes the type of parallelism to use and on how many devices.
    Note: Although the flag refers to tensor parallelism, expert parallelism (EP) is used for the MoE components.
  • --max-model-len - describes the maximum context length. Additionally, it sets –max-num-batched-tokens to the same value when the model with MLA is run.
  • --max-num-seqs - sets the maximum number of sequences. *
  • --max-num-batched-tokens - decides how many tokens can be processed simultaneously across all sequences. This differs from --max-num-seqs - particularly during the prompt phase, where all input tokens across the batch are processed at once. Let’s assume that all requests will have 1024 input tokens. Setting this parameter to 8192 will limit the number of prompts we can process simultaneously to 8, since 8192/1024=8. However, it won’t affect the decode “batch size”, since max-num-seqs is smaller than max-num-batched-tokens
  • --gpu_memory_utilization - specifies the memory margin after loading the model and running a profile

* This should not be called batch size, since vLLM is using continuous batching, but to make it simpler, I’ve decided to call it batch size

Option 2: Run with static quantization:

B) Get the calibration file
Refer to the vllm-hpu-extension calibration README to obtain the measurement files.

A) Start vLLM

QUANT_CONFIG=<path to quant config>  \
VLLM_HPU_FORCE_CHANNEL_FP8=0 \
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8688 \
   --block-size 128 \
   --model deepseek-ai/DeepSeek-R1 \
   --tensor-parallel-size 8 \
   --trust-remote-code  \
   --max-model-len 8192 \
   --max-num-seqs 128 \
   --gpu_memory_utilization 0.9 \
   --enable-expert-parallel

The vLLM server is ready to serve when the log below is displayed:

Starting vLLM API server on http://0.0.0.0:8688
Available routes are:
Route: /openapi.json, Methods: HEAD, GET

Validating vLLM Deployment

To check if vLLM processes requests correctly, you can send a sample inference request using cURL.


Step 1: Get a list of models

curl http://localhost:8688/v1/models

The output should look like the below:

{
  "object": "list",
  "data": [
    {
      "id": "/path/to /DeepSeek-R1/",
      "object": "model",
...

Step 2: Send an example request using the model from the previous step

curl http://localhost:8688/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": “<model name>",
        "prompt": "Gdansk is a",
        "max_tokens": 32,
        "temperature": 0
    }'

Step 3: Check the vLLM performance with the benchmark_serving.py script

A) Log in to the same container

docker exec -it deepseek_server /bin/bash

B) Change directory

cd vllm-fork/benchmarks

C) Run the following command

python3 benchmark_serving.py --backend vllm \
   --model <model path> \
   --trust-remote-code \
   --host 0.0.0.0  \
   --port 8688  \
   --dataset-name random \
   --random-input-len 1024 \
   --random-output-len 1024 \
   --random-range-ratio 0 \ 
   --max_concurrency 128 \
   --num-prompts 128 \
   --request-rate inf \
   --seed 0 \
   --ignore_eos 

The output should look like the below:

============ Serving Benchmark Result ============
Successful requests:                     256
Benchmark duration (s):                  x
Total input tokens:                      261888
Total generated tokens:                  262144
Request throughput (req/s):              x
Output token throughput (tok/s):         x
Total Token throughput (tok/s):          x

This script calls the standard vLLM benchmark serving tool to check the vLLM throughput. The input and output are both 1k.

Step 4: Check the vLLM accuracy

 We use lm_eval to measure model accuracy. The following example demonstrates how to run it on the GSM8K dataset, which can be adapted to work with any other supported dataset.

A) Install lm_eval

pip install lm_eval[api]

B) Run lm_eval

Change the model path, vLLM IP address, or port in the command below if required:

lm_eval --model local-completions \
   --tasks gsm8k \
   --model_args <model path>,base_url=http://127.0.0.1:8688/v1/completions \
   --batch_size 16 \
   --log_samples \
   --output_path ./lm_eval_output

Batch size affects the time of execution and memory utilization.

Multi-Node Setup Guide for DeepSeek-R1 671B with Pipeline Parallelism

This guide shows how to deploy a multi-node setup using Pipeline Parallelism. This approach splits the model layer-wise and places half of it on each node. Alternatively, the same goal can be accomplished using Disaggregated Prefilling, where one node creates a Key-Value (KV) cache and the second is responsible for consuming it.

Prerequisites:

Set up the environment as in steps 1-3 in the Single-Node setup.

Step 1*:  Run HCCL Demo Test (optional).
Using the assigned IPs on the two nodes (16 HPUs), ensure the HCCL demo test passes. Follow the instructions provided in the HCCL Demo Guide.

Step 2: Install VLLM on both nodes

git clone -b "v1.22.0" https://github.com/HabanaAI/vllm-fork.git
pip install vllm-fork

Step 3: Configure the environment

A) Set the IP address and NIC interface name

export GLOO_SOCKET_IFNAME=eth0
export HCCL_SOCKET_IFNAME=eth0

B) Configure the environment as in the Single-Node scenario, step 4. Make sure that you have the same configuration on both nodes

Delayed sampling is not supported with Pipeline parallelism. It can be turned off by using:

export VLLM_DELAYED_SAMPLING=false

Step 4: Start the Ray cluster

A) Start Ray on the head node

ray start --head --node-ip-address=<HEAD_NODE_IP> --port=<PORT>

Example:

ray start --head --node-ip-address=192.168.1.101 --port=8850

B) Start Ray on the worker node

ray start --address='HEAD_NODE_IP:port'

Example:

ray start --address='192.168.1.101:8850'

Step 5: Start vLLM on the head node

python -m vllm.entrypoints.openai.api_server \
    --host 192.168.1.101 \
    --port 8688 \
    --model deepseek-ai/DeepSeek-R1 \
    --tensor-parallel-size 8 \
    --pipeline-parallel-size 2 \
    --max-num-seqs 128 \
    --block-size 128 \
    --max-model-len 8192\
    --distributed_executor_backend ray \
    --gpu_memory_utilization 0.9 \
    --enable-expert-parallel \
    --trust_remote_code

The vLLM server is ready to serve when the log below is displayed:

Starting vLLM API server on http://0.0.0.0:8688
Available routes are:
Route: /openapi.json, Methods: HEAD, GET​


Additional Information

How to update the Gaudi driver

How to update the Gaudi firmware

vLLM Gaudi readme


Notices & Disclaimers

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure. 

Your costs and results may vary. 

© Intel Corporation.  Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.  Other names and brands may be claimed as the property of others.