Intel® Gaudi® AI Accelerator
Support for the Intel® Gaudi® AI Accelerator
19 Discussions

RuntimeError: synStatus=8 [Device not found] Device acquire failed. No devices found.

vkumar4
Employee
2,990 Views

I have been following the Intel Gaudi driver and software installation and I am interested in k8s operator installation where its specifically mentioned as well that - Driver and Software Installation is not required if you are using the Intel Gaudi Base Operator for Kubernetes or OpenShift.

Now that I have setup a k8s cluster on a gaudi vm and I have deployed the operator successfully and its all pods are running as expected - 

vkumar4_0-1747993915095.png

When I am trying to deploy a LLM model the pods is failing with below error - 

Prompt bucket config (min, step, max_warmup) bs:[1, 32, 16], seq:[128, 128, 33024]
Decode bucket config (min, step, max_warmup) bs:[1, 32, 32], block:[128, 256, 8256]
ERROR 05-23 09:23:44 engine.py:381] synStatus=8 [Device not found] Device acquire failed. No devices found.
ERROR 05-23 09:23:44 engine.py:381] Traceback (most recent call last):
RuntimeError: synStatus=8 [Device not found] Device acquire failed. No devices found.
ERROR 05-23 09:23:44 engine.py:381]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 372, in run_mp_engine
ERROR 05-23 09:23:44 engine.py:381]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 05-23 09:23:44 engine.py:381]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 120, in from_engine_args
ERROR 05-23 09:23:44 engine.py:381]     return cls(ipc_path=ipc_path,
ERROR 05-23 09:23:44 engine.py:381]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/engine/multiprocessing/engine.py", line 72, in __init__
ERROR 05-23 09:23:44 engine.py:381]     self.engine = LLMEngine(*args, **kwargs)
ERROR 05-23 09:23:44 engine.py:381]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/engine/llm_engine.py", line 271, in __init__
ERROR 05-23 09:23:44 engine.py:381]     self.model_executor = executor_class(vllm_config=vllm_config, )
ERROR 05-23 09:23:44 engine.py:381]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/executor/executor_base.py", line 43, in __init__
ERROR 05-23 09:23:44 engine.py:381]     self._init_executor()
ERROR 05-23 09:23:44 engine.py:381]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/executor/uniproc_executor.py", line 39, in _init_executor
ERROR 05-23 09:23:44 engine.py:381]     self.collective_rpc("init_device")
ERROR 05-23 09:23:44 engine.py:381]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/executor/uniproc_executor.py", line 51, in collective_rpc
ERROR 05-23 09:23:44 engine.py:381]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 05-23 09:23:44 engine.py:381]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/utils.py", line 2288, in run_method
ERROR 05-23 09:23:44 engine.py:381]     return func(*args, **kwargs)
ERROR 05-23 09:23:44 engine.py:381]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/worker/hpu_worker.py", line 203, in init_device
ERROR 05-23 09:23:44 engine.py:381]     torch.hpu.set_device(self.device)
ERROR 05-23 09:23:44 engine.py:381]   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 337, in set_device
ERROR 05-23 09:23:44 engine.py:381]     device_idx = _get_device_index(device, optional=True)
ERROR 05-23 09:23:44 engine.py:381]   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/_utils.py", line 60, in _get_device_index
ERROR 05-23 09:23:44 engine.py:381]     device_idx = hpu.current_device()
ERROR 05-23 09:23:44 engine.py:381]   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 153, in current_device
ERROR 05-23 09:23:44 engine.py:381]     _lazy_init()
ERROR 05-23 09:23:44 engine.py:381]   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py", line 81, in _lazy_init
ERROR 05-23 09:23:44 engine.py:381]     _hpu_C.init()
ERROR 05-23 09:23:44 engine.py:381] RuntimeError: synStatus=8 [Device not found] Device acquire failed. No devices found.
^[[A^[[A^[[A^[[B^[[BTraceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 832, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 796, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 125, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post1+gaudi000-py3.10.egg/vllm/entrypoints/openai/api_server.py", line 219, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

 

Note: If I am also installing the Driver and Software Installation explicitly on the node itself then it working fine. 

What is this behavior? and can I run my pod with only operator deployed.

0 Kudos
1 Reply
James_Edwards
Employee
2,663 Views

It appears that the pod doesn't have access to the Gaudi's. This could be how the pod is getting launched. Can you execute the following command on the pod (this will check if the devices are exposed):

.

kubectl exec <pod-name> -- hl-smi

.

Also, it would be helpful to get the pods configuration information (this gives info about how it was launched):

.

kubectl describe pod <pod-name> -n <namespace>

0 Kudos
Reply