- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am trying to use VLLM and Gaudi2 accelerator for LLM inference. I am using the VLLM code from HabanaAI/vllm-fork repository. According to the instructions in the Readme file, I used docker to pull the image and compiled and installed vllm in the docker image.
When I tried to use vllm serve, RuntimeError: synStatus=26 [Generic failure] Device acquire failed occurred. I will provide detailed installation methods and error output, please help me solve this problem.
this is the command i used to pull docker and run the container
$ docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
this is the command i used to build vllm
$ git clone https://github.com/HabanaAI/vllm-fork.git
$ cd vllm-fork
$ git checkout habana_main
$ pip install -r requirements-hpu.txt
$ python setup.py develop
after that i run vllm serve with this command
vllm serve Qwen2.5
this is all output i got
root@idc-training-gaudi-compute-01:/tmp# vllm serve Qwen2.5
/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
return isinstance(object, types.FunctionType)
Detected capabilities: [-cpu -gaudi +gaudi2 -gaudi3 -index_reduce]
INFO 11-26 01:28:01 api_server.py:592] vLLM API server version 0.6.3.dev1139+g5eb8b1f7
INFO 11-26 01:28:01 api_server.py:593] args: Namespace(subparser='serve', model_tag='Qwen2.5', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen2.5', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', weights_load_device=None, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=128, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, use_padding_aware_scheduling=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_num_prefill_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7f5e8ca29a20>)
INFO 11-26 01:28:01 __init__.py:31] No plugins found.
INFO 11-26 01:28:01 api_server.py:176] Multiprocessing frontend to use ipc:///tmp/6868efba-350e-46f7-80f0-c1504757c6fb for IPC Path.
INFO 11-26 01:28:01 api_server.py:195] Started engine process with PID 272
/usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
return isinstance(object, types.FunctionType)
Detected capabilities: [-cpu -gaudi +gaudi2 -gaudi3 -index_reduce]
INFO 11-26 01:28:06 __init__.py:31] No plugins found.
INFO 11-26 01:28:06 config.py:350] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.
WARNING 11-26 01:28:06 arg_utils.py:1092] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-26 01:28:11 config.py:350] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
WARNING 11-26 01:28:11 arg_utils.py:1092] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 11-26 01:28:11 llm_engine.py:250] Initializing an LLM engine (v0.6.3.dev1139+g5eb8b1f7) with config: model='Qwen2.5', speculative_config=None, tokenizer='Qwen2.5', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, weights_load_device=hpu, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen2.5, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None, pooler_config=None)
INFO 11-26 01:28:12 __init__.py:31] No plugins found.
WARNING 11-26 01:28:12 utils.py:754] Pin memory is not supported on HPU.
INFO 11-26 01:28:12 selector.py:174] Using HPUAttention backend.
VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
VLLM_PROMPT_BS_BUCKET_MAX=256 (default:256)
VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
VLLM_DECODE_BS_BUCKET_MAX=256 (default:256)
VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
VLLM_DECODE_BLOCK_BUCKET_MAX=4096 (default:4096)
Prompt bucket config (min, step, max_warmup) bs:[1, 32, 256], seq:[128, 128, 1024]
Decode bucket config (min, step, max_warmup) bs:[1, 32, 256], block:[128, 128, 4096]
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi_common/hcl_device_control_factory.cpp::84(initDevice): The condition [ g_ibv.init(deviceConfig) == hcclSuccess ] failed. ibv initialization failed
ERROR 11-26 01:28:14 engine.py:369] synStatus=26 [Generic failure] Device acquire failed.
ERROR 11-26 01:28:14 engine.py:369] Traceback (most recent call last):
ERROR 11-26 01:28:14 engine.py:369] File "/root/vllm-fork/vllm/engine/multiprocessing/engine.py", line 360, in run_mp_engine
ERROR 11-26 01:28:14 engine.py:369] engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 11-26 01:28:14 engine.py:369] File "/root/vllm-fork/vllm/engine/multiprocessing/engine.py", line 120, in from_engine_args
ERROR 11-26 01:28:14 engine.py:369] return cls(ipc_path=ipc_path,
ERROR 11-26 01:28:14 engine.py:369] File "/root/vllm-fork/vllm/engine/multiprocessing/engine.py", line 72, in __init__
ERROR 11-26 01:28:14 engine.py:369] self.engine = LLMEngine(*args, **kwargs)
ERROR 11-26 01:28:14 engine.py:369] File "/root/vllm-fork/vllm/engine/llm_engine.py", line 347, in __init__
ERROR 11-26 01:28:14 engine.py:369] self.model_executor = executor_class(vllm_config=vllm_config, )
ERROR 11-26 01:28:14 engine.py:369] File "/root/vllm-fork/vllm/executor/executor_base.py", line 36, in __init__
ERROR 11-26 01:28:14 engine.py:369] self._init_executor()
ERROR 11-26 01:28:14 engine.py:369] File "/root/vllm-fork/vllm/executor/hpu_executor.py", line 28, in _init_executor
ERROR 11-26 01:28:14 engine.py:369] self._init_worker()
ERROR 11-26 01:28:14 engine.py:369] File "/root/vllm-fork/vllm/executor/hpu_executor.py", line 74, in _init_worker
ERROR 11-26 01:28:14 engine.py:369] self.driver_worker.init_device()
ERROR 11-26 01:28:14 engine.py:369] File "/root/vllm-fork/vllm/worker/hpu_worker.py", line 130, in init_device
ERROR 11-26 01:28:14 engine.py:369] torch.hpu.set_device(self.device)
ERROR 11-26 01:28:14 engine.py:369] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 308, in set_device
ERROR 11-26 01:28:14 engine.py:369] device_idx = _get_device_index(device, optional=True)
ERROR 11-26 01:28:14 engine.py:369] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/_utils.py", line 42, in _get_device_index
ERROR 11-26 01:28:14 engine.py:369] device_idx = hpu.current_device()
ERROR 11-26 01:28:14 engine.py:369] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 132, in current_device
ERROR 11-26 01:28:14 engine.py:369] init()
ERROR 11-26 01:28:14 engine.py:369] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 71, in init
ERROR 11-26 01:28:14 engine.py:369] _hpu_C.init()
ERROR 11-26 01:28:14 engine.py:369] RuntimeError: synStatus=26 [Generic failure] Device acquire failed.
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/vllm-fork/vllm/engine/multiprocessing/engine.py", line 371, in run_mp_engine
raise e
File "/root/vllm-fork/vllm/engine/multiprocessing/engine.py", line 360, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
File "/root/vllm-fork/vllm/engine/multiprocessing/engine.py", line 120, in from_engine_args
return cls(ipc_path=ipc_path,
File "/root/vllm-fork/vllm/engine/multiprocessing/engine.py", line 72, in __init__
self.engine = LLMEngine(*args, **kwargs)
File "/root/vllm-fork/vllm/engine/llm_engine.py", line 347, in __init__
self.model_executor = executor_class(vllm_config=vllm_config, )
File "/root/vllm-fork/vllm/executor/executor_base.py", line 36, in __init__
self._init_executor()
File "/root/vllm-fork/vllm/executor/hpu_executor.py", line 28, in _init_executor
self._init_worker()
File "/root/vllm-fork/vllm/executor/hpu_executor.py", line 74, in _init_worker
self.driver_worker.init_device()
File "/root/vllm-fork/vllm/worker/hpu_worker.py", line 130, in init_device
torch.hpu.set_device(self.device)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 308, in set_device
device_idx = _get_device_index(device, optional=True)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/_utils.py", line 42, in _get_device_index
device_idx = hpu.current_device()
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 132, in current_device
init()
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 71, in init
_hpu_C.init()
RuntimeError: synStatus=26 [Generic failure] Device acquire failed.
Traceback (most recent call last):
File "/usr/local/bin/vllm", line 33, in <module>
sys.exit(load_entry_point('vllm', 'console_scripts', 'vllm')())
File "/root/vllm-fork/vllm/scripts.py", line 201, in main
args.dispatch_function(args)
File "/root/vllm-fork/vllm/scripts.py", line 42, in serve
uvloop.run(run_server(args))
File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
File "/root/vllm-fork/vllm/entrypoints/openai/api_server.py", line 616, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/root/vllm-fork/vllm/entrypoints/openai/api_server.py", line 114, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/root/vllm-fork/vllm/entrypoints/openai/api_server.py", line 211, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi kunger,
Thank you for reaching out to us.
We apologize for the inconvenience that you are experiencing. I have escalated your case to the appropriate team for further investigation of this matter. Upon receiving their feedback, I will provide you with an update. We greatly appreciate your patience.
Regards,
Faiz
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kunger,
According to our business team, the current state of IDC console doesn’t require a driver change. Only if you plan to pull code from Habana Labs GitHub repository, you would require a driver upgrade. Can we support you further?
Kind regards,
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page