- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have been following the Intel Gaudi driver and software installation and I am interested in k8s operator installation where its specifically mentioned as well that - Driver and Software Installation is not required if you are using the Intel Gaudi Base Operator for Kubernetes or OpenShift.
Now that I have setup a k8s cluster on a gaudi vm and I have deployed the operator successfully and its all pods are running as expected -
When I am trying to deploy a LLM model the pods is failing with below error -
Prompt bucket config (min, step, max_warmup) bs:[1, 32, 16], seq:[128, 128, 33024]
Decode bucket config (min, step, max_warmup) bs:[1, 32, 32], block:[128, 256, 8256]
ERROR 05-23 09:23:44 engine.py:381] synStatus=8 [Device not found] Device acquire failed. No devices found.
ERROR 05-23 09:23:44 engine.py:381] Traceback (most recent call last):
RuntimeError: synStatus=8 [Device not found] Device acquire failed. No devices found.
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 372, in run_mp_engine
ERROR 05-23 09:23:44 engine.py:381] engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 120, in from_engine_args
ERROR 05-23 09:23:44 engine.py:381] return cls(ipc_path=ipc_path,
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 72, in __init__
ERROR 05-23 09:23:44 engine.py:381] self.engine = LLMEngine(*args, **kwargs)
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/engine/llm_engine.py", line 271, in __init__
ERROR 05-23 09:23:44 engine.py:381] self.model_executor = executor_class(vllm_config=vllm_config, )
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/executor/executor_base.py", line 43, in __init__
ERROR 05-23 09:23:44 engine.py:381] self._init_executor()
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/executor/uniproc_executor.py", line 39, in _init_executor
ERROR 05-23 09:23:44 engine.py:381] self.collective_rpc("init_device")
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/executor/uniproc_executor.py", line 51, in collective_rpc
ERROR 05-23 09:23:44 engine.py:381] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/utils.py", line 2288, in run_method
ERROR 05-23 09:23:44 engine.py:381] return func(*args, **kwargs)
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/worker/hpu_worker.py", line 203, in init_device
ERROR 05-23 09:23:44 engine.py:381] torch.hpu.set_device(self.device)
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 337, in set_device
ERROR 05-23 09:23:44 engine.py:381] device_idx = _get_device_index(device, optional=True)
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/_utils.py", line 60, in _get_device_index
ERROR 05-23 09:23:44 engine.py:381] device_idx = hpu.current_device()
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 153, in current_device
ERROR 05-23 09:23:44 engine.py:381] _lazy_init()
ERROR 05-23 09:23:44 engine.py:381] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 81, in _lazy_init
ERROR 05-23 09:23:44 engine.py:381] _hpu_C.init()
ERROR 05-23 09:23:44 engine.py:381] RuntimeError: synStatus=8 [Device not found] Device acquire failed. No devices found.
^[[A^[[A^[[A^[[B^[[BTraceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 832, in <module>
uvloop.run(run_server(args))
File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 796, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 125, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 219, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
Note: If I am also installing the Driver and Software Installation explicitly on the node itself then it working fine.
What is this behavior? and can I run my pod with only operator deployed.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It appears that the pod doesn't have access to the Gaudi's. This could be how the pod is getting launched. Can you execute the following command on the pod (this will check if the devices are exposed):
.
kubectl exec <pod-name> -- hl-smi
.
Also, it would be helpful to get the pods configuration information (this gives info about how it was launched):
.
kubectl describe pod <pod-name> -n <namespace>

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page