- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have been following the Intel Gaudi driver and software installation and I am interested in k8s operator installation where its specifically mentioned as well that - Driver and Software Installation is not required if you are using the Intel Gaudi Base Operator for Kubernetes or OpenShift.
Now that I have setup a k8s cluster on a gaudi vm and I have deployed the operator successfully and its all pods are running as expected -
When I am trying to deploy a LLM model the pods is failing with below error -
Prompt bucket config (min, step, max_warmup) bs:[1, 32, 16], seq:[128, 128, 33024]
Decode bucket config (min, step, max_warmup) bs:[1, 32, 32], block:[128, 256, 8256]
ERROR 05-23 09:23:44 engine.py:381] synStatus=8 [Device not found] Device acquire failed. No devices found.
ERROR 05-23 09:23:44 engine.py:381] Traceback (most recent call last):
RuntimeError: synStatus=8 [Device not found] Device acquire failed. No devices found.
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 372, in run_mp_engine
ERROR 05-23 09:23:44 engine.py:381] engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 120, in from_engine_args
ERROR 05-23 09:23:44 engine.py:381] return cls(ipc_path=ipc_path,
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 72, in __init__
ERROR 05-23 09:23:44 engine.py:381] self.engine = LLMEngine(*args, **kwargs)
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/engine/llm_engine.py", line 271, in __init__
ERROR 05-23 09:23:44 engine.py:381] self.model_executor = executor_class(vllm_config=vllm_config, )
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/executor/executor_base.py", line 43, in __init__
ERROR 05-23 09:23:44 engine.py:381] self._init_executor()
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/executor/uniproc_executor.py", line 39, in _init_executor
ERROR 05-23 09:23:44 engine.py:381] self.collective_rpc("init_device")
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/executor/uniproc_executor.py", line 51, in collective_rpc
ERROR 05-23 09:23:44 engine.py:381] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/utils.py", line 2288, in run_method
ERROR 05-23 09:23:44 engine.py:381] return func(*args, **kwargs)
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/worker/hpu_worker.py", line 203, in init_device
ERROR 05-23 09:23:44 engine.py:381] torch.hpu.set_device(self.device)
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 337, in set_device
ERROR 05-23 09:23:44 engine.py:381] device_idx = _get_device_index(device, optional=True)
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/_utils.py", line 60, in _get_device_index
ERROR 05-23 09:23:44 engine.py:381] device_idx = hpu.current_device()
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 153, in current_device
ERROR 05-23 09:23:44 engine.py:381] _lazy_init()
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 81, in _lazy_init
ERROR 05-23 09:23:44 engine.py:381] _hpu_C.init()
ERROR 05-23 09:23:44 engine.py:381] RuntimeError: synStatus=8 [Device not found] Device acquire failed. No devices found.
^[[A^[[A^[[A^[[B^[[BTraceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 832, in <module>
uvloop.run(run_server(args))
File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 796, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 125, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 219, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
Note: If I am also installing the Driver and Software Installation explicitly on the node itself then it working fine.
What is this behavior? and can I run my pod with only operator deployed.
Link Copied
0 Replies

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page