Intel® Gaudi® AI Accelerator
Support for the Intel® Gaudi® AI Accelerator
13 Discussions

Error starting containers using habana-container-runtime

Gera_Dmz
Employee
2,384 Views

I am experiencing an issue where I am unable to access Gaudi accelerators when creating a Docker container using the Habana runtime.

 

Steps to reproduce:

  1. Installed Gaudi drivers & Software.
  2. Built binaries.
  3. Configured both /etc/docker/daemon.json & /etc/containerd/config.toml.
  4. Ran docker run --rm --runtime=habana -e HABANA_VISIBLE_DEVICES=all ubuntu:22.04 /bin/bash -c "ls /dev/accel/*" and got:

 

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: exposing interfaces: failed creating temporary link on host: invalid argument
exit status 1: unknown.​

 

  • Tried docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest but also got:

 

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: exposing interfaces: failed creating temporary link on host: invalid argument
exit status 1: unknown.​

 

  • Removing -e HABANA_VISIBLE_DEVICES=all I'm able to exec into the container, but the accelerators are not visible inside the container:

 

# hl-smi
habanalabs driver is not loaded or no AIPs available, aborting...
# ls /dev/accel
ls: cannot access '/dev/accel': No such file or directory​

 

OS
Ubuntu 22.04.4 LTS

Kernel Version
5.15.0-117-generic

Container Runtime Type/Version
1.19.1

K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS)
Docker version 27.5.0

Extra logs and files
From the host machine:

 

$ hl-smi
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.19.1-fw-57.2.2.0 |
| Driver Version: 1.19.1-6f47ddd |
|-------------------------------+----------------------+----------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncor-Events|
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-225 N/A | 0000:33:00.0 N/A | 0 |
| N/A 24C N/A 88W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 1 HL-225 N/A | 0000:9a:00.0 N/A | 0 |
| N/A 25C N/A 92W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 2 HL-225 N/A | 0000:34:00.0 N/A | 0 |
| N/A 26C N/A 76W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 3 HL-225 N/A | 0000:9b:00.0 N/A | 0 |
| N/A 27C N/A 102W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 4 HL-225 N/A | 0000:4d:00.0 N/A | 0 |
| N/A 27C N/A 90W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 5 HL-225 N/A | 0000:4e:00.0 N/A | 0 |
| N/A 25C N/A 82W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 6 HL-225 N/A | 0000:b4:00.0 N/A | 0 |
| N/A 25C N/A 65W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 7 HL-225 N/A | 0000:b3:00.0 N/A | 0 |
| N/A 27C N/A 84W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 N/A N/A N/A N/A |
| 1 N/A N/A N/A N/A |
| 2 N/A N/A N/A N/A |
| 3 N/A N/A N/A N/A |
| 4 N/A N/A N/A N/A |
| 5 N/A N/A N/A N/A |
| 6 N/A N/A N/A N/A |
| 7 N/A N/A N/A N/A |
+=============================================================================+

$ tail -n 1 /var/log/habana-container-runtime.log
{"time":"2025-01-22T22:10:20.545416796Z","level":"INFO","msg":"file does not exist on host: /etc/habanalabs/gaudinet.json"}

$ tail -n 1 /var/log/habana-container-hook.log
{"time":"2025-01-22T22:10:20.569909471Z","level":"ERROR","msg":"exposing interfaces: failed creating temporary link on host: invalid argument"}

 

0 Kudos
22 Replies
James_Edwards
Employee
2,204 Views

This looks like an issue with your installation or configuration of the habana-container-runtime. On the system, what is the output of:

`dpkg -l | grep habanalabs-container-runtime`

I am particularly interested in the version.

 

Also, If you are using docker, remove the containerd toml file and check your docker configuration file again. The /etc/docker/daemon.json file should look similar to this:

 

{
   "default-runtime": "habana",
   "runtimes": {
      "habana": {
         "path": "/usr/bin/habana-container-runtime",
         "runtimeArgs": []
      }
   }
}

 Then restart the docker service:

.

sudo systemctl restart docker

 

0 Kudos
Gera_Dmz
Employee
2,180 Views

Thanks @James_Edwards for replying. The installed version is:

 

$ dpkg -l | grep habanalabs-container-runtime
ii  habanalabs-container-runtime              1.19.1-26                                   amd64        HABANA container runtime

 

 

/etc/docker/daemon.json is as follows:

 

$ cat /etc/docker/daemon.json
{"runtimes": {"habana": {"path": "/usr/bin/habana-container-runtime", "runtimeArgs": []}}, "default-runtime": "habana"}

 

 

Removing /etc/containerd/config.toml did help on being able to exec into the container using docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest. But inside the container I'm still getting:

 

# hl-smi
habanalabs driver is not loaded or no AIPs available, aborting...

 

 

0 Kudos
James_Edwards
Employee
2,174 Views

1) Check the output of the habana-container-hook.log and the habana-container-runtime.log again to see if there are any other errors associated with the docker server. (By the way, the gaudinet.json file is not required for single gaudi nodes, and that error is not important. The ERROR message in the habana-container-runtime.log was a problem, however).

2) Start the docker container with the -d option and then run 'docker logs <container id>' to see if there are any errors on container issues on startup.

3) On the host get the permissions on the Gaudi accelerator devices: 'ls -l /dev/accel'

.

0 Kudos
Gera_Dmz
Employee
2,170 Views

1) Checking habana-container-hook.log I see this new error:

$ tail -n 1 /var/log/habana-container-hook.log
{"time":"2025-02-10T22:44:50.45307364Z","level":"INFO","msg":"device already exists in namespace. Host network used?"}

 2) I don't get any logs when using the detached option:

$ docker run -d --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest sleep infinity
b8ca5e99989bef171fde1355f9e40c78f77914c96184cfa52e3f877c95990981
$ docker ps
CONTAINER ID   IMAGE                                                                                       COMMAND            CREATED         STATUS         PORTS     NAMES
b8ca5e99989b   vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest   "sleep infinity"   3 seconds ago   Up 3 seconds             romantic_williams
$ docker logs b8ca5e99989b

3) The permissions are:

$ ls -l /dev/accel/
total 0
crw-rw-rw- 1 root root 509,  0 Jan 14 23:23 accel0
crw-rw-rw- 1 root root 509,  2 Jan 14 23:23 accel1
crw-rw-rw- 1 root root 509,  4 Jan 14 23:23 accel2
crw-rw-rw- 1 root root 509,  6 Jan 14 23:23 accel3
crw-rw-rw- 1 root root 509,  8 Jan 14 23:23 accel4
crw-rw-rw- 1 root root 509, 10 Jan 14 23:23 accel5
crw-rw-rw- 1 root root 509, 12 Jan 14 23:23 accel6
crw-rw-rw- 1 root root 509, 14 Jan 14 23:23 accel7
crw-rw-rw- 1 root root 509,  1 Jan 14 23:23 accel_controlD0
crw-rw-rw- 1 root root 509,  3 Jan 14 23:23 accel_controlD1
crw-rw-rw- 1 root root 509,  5 Jan 14 23:23 accel_controlD2
crw-rw-rw- 1 root root 509,  7 Jan 14 23:23 accel_controlD3
crw-rw-rw- 1 root root 509,  9 Jan 14 23:23 accel_controlD4
crw-rw-rw- 1 root root 509, 11 Jan 14 23:23 accel_controlD5
crw-rw-rw- 1 root root 509, 13 Jan 14 23:23 accel_controlD6
crw-rw-rw- 1 root root 509, 15 Jan 14 23:23 accel_controlD7

 

0 Kudos
James_Edwards
Employee
2,132 Views

The device permissions all seem good. I was also able to run your docker command and get all devices (I used the 1.19.0 version of the container runtime). This still seems like an issue with the configuration of the container runtime. What does this command say:

.

docker inspect <container_id> | grep Runtime

.

If you don't get a string that says "Runtime": "habana" there is a configuration error still. Post the entire output of docker inspect so I can look at it otherwise.

0 Kudos
Gera_Dmz
Employee
2,083 Views

I do see "Runtime": "habana":

$ docker ps
CONTAINER ID   IMAGE                                                                                       COMMAND            CREATED              STATUS              PORTS     NAMES
eb0bf40f6b1d   vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest   "sleep infinity"   About a minute ago   Up About a minute             boring_lehmann
$ docker inspect eb0bf40f6b1d | grep Runtime
            "Runtime": "habana",
            "CpuRealtimeRuntime": 0,

Inspecting the container:

$ docker inspect eb0bf40f6b1d
[
    {
        "Id": "eb0bf40f6b1d0392e38e9e1542650fab9df37e851d4d74388dfe2d904fcabdce",
        "Created": "2025-02-11T16:06:30.690094085Z",
        "Path": "sleep",
        "Args": [
            "infinity"
        ],
        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 31171,
            "ExitCode": 0,11:30 AM 2/11/2025
            "Error": "",
            "StartedAt": "2025-02-11T16:06:33.601305388Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        },
        "Image": "sha256:1d0c9dbfacfdffb8ed752b9f56519c3290645bb4ec81542a7256feb18f65c9a3",
        "ResolvConfPath": "/var/lib/docker/containers/eb0bf40f6b1d0392e38e9e1542650fab9df37e851d4d74388dfe2d904fcabdce/resolv.conf",
        "HostnamePath": "/var/lib/docker/containers/eb0bf40f6b1d0392e38e9e1542650fab9df37e851d4d74388dfe2d904fcabdce/hostname",
        "HostsPath": "/var/lib/docker/containers/eb0bf40f6b1d0392e38e9e1542650fab9df37e851d4d74388dfe2d904fcabdce/hosts",
        "LogPath": "/var/lib/docker/containers/eb0bf40f6b1d0392e38e9e1542650fab9df37e851d4d74388dfe2d904fcabdce/eb0bf40f6b1d0392e38e9e1542650fab9df37e851d4d74388dfe2d904fcabdce-json.log",
        "Name": "/boring_lehmann",
        "RestartCount": 0,
        "Driver": "overlay2",
        "Platform": "linux",
        "MountLabel": "",
        "ProcessLabel": "",
        "AppArmorProfile": "docker-default",
        "ExecIDs": null,
        "HostConfig": {
            "Binds": null,
            "ContainerIDFile": "",
            "LogConfig": {
                "Type": "json-file",
                "Config": {}
            },
            "NetworkMode": "host",
            "PortBindings": {},
            "RestartPolicy": {
                "Name": "no",
                "MaximumRetryCount": 0
            },
            "AutoRemove": false,
            "VolumeDriver": "",
            "VolumesFrom": null,
            "ConsoleSize": [
                49,
                189
            ],
            "CapAdd": [
                "sys_nice"
            ],
            "CapDrop": null,
            "CgroupnsMode": "private",
            "Dns": [],
            "DnsOptions": [],
            "DnsSearch": [],
            "ExtraHosts": null,
            "GroupAdd": null,
            "IpcMode": "host",
            "Cgroup": "",
            "Links": null,
            "OomScoreAdj": 0,
            "PidMode": "",
            "Privileged": false,
            "PublishAllPorts": false,
            "ReadonlyRootfs": false,
            "SecurityOpt": [
                "label=disable"
            ],
            "UTSMode": "",
            "UsernsMode": "",
            "ShmSize": 67108864,
            "Runtime": "habana",
            "Isolation": "",
            "CpuShares": 0,
            "Memory": 0,
            "NanoCpus": 0,
            "CgroupParent": "",
            "BlkioWeight": 0,
            "BlkioWeightDevice": [],
            "BlkioDeviceReadBps": [],
            "BlkioDeviceWriteBps": [],
            "BlkioDeviceReadIOps": [],
            "BlkioDeviceWriteIOps": [],
            "CpuPeriod": 0,
            "CpuQuota": 0,
            "CpuRealtimePeriod": 0,
            "CpuRealtimeRuntime": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
            "Devices": [],
            "DeviceCgroupRules": null,
            "DeviceRequests": null,
            "MemoryReservation": 0,
            "MemorySwap": 0,
            "MemorySwappiness": null,
            "OomKillDisable": null,
            "PidsLimit": null,
            "Ulimits": [],
            "CpuCount": 0,
            "CpuPercent": 0,
            "IOMaximumIOps": 0,
            "IOMaximumBandwidth": 0,
            "MaskedPaths": [
                "/proc/asound",
                "/proc/acpi",
                "/proc/kcore",
                "/proc/keys",
                "/proc/latency_stats",
                "/proc/timer_list",
                "/proc/timer_stats",
                "/proc/sched_debug",
                "/proc/scsi",
                "/sys/firmware",
                "/sys/devices/virtual/powercap"
            ],
            "ReadonlyPaths": [
                "/proc/bus",
                "/proc/fs",
                "/proc/irq",
                "/proc/sys",
                "/proc/sysrq-trigger"
            ]
        },
        "GraphDriver": {
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/99f176c4457ecbc3eb35beb6ae2e7c8c98c5897c6a74ed95d4a617f239051f84-init/diff:/var/lib/docker/overlay2/84a1dd5c7734190f9d1d61036a16e11c84e61574e09e64eec206d0c32b48054a/diff:/var/lib/docker/overlay2/f927604ccd4f437148a7c90d3e4e6f42266d84bd4ef415e807e18ccaf67362d5/diff:/var/lib/docker/overlay2/4ef1a9882c291c23760de9b0a74e196e230f1f7c37f61ec38db07539f4bd7f78/diff:/var/lib/docker/overlay2/720dd7a239d971dfc9243be25f090392ab5d48825897aef448d1b3a867c397d5/diff:/var/lib/docker/overlay2/65ff9290ebd144007a656959a1bbf2aa2c147fef0bc0f2af441bcd7077c53a18/diff:/var/lib/docker/overlay2/9b6f21580d4adc9da99557c895d6f271116421007d4a7a1c23d7f238ce41dd57/diff:/var/lib/docker/overlay2/3436c6314fa980e97b2f1fad0df95cb41caf626a7f776852405964dbc76d8d6a/diff:/var/lib/docker/overlay2/e426528d3b3300f3087fb6d1a59f72a243ed57a722f576b4f94f8574ef6e9e28/diff:/var/lib/docker/overlay2/169b6c543d8a5bd1ca87e26c1c2050f8e95635eb8e26122c97a6811cefa434d5/diff:/var/lib/docker/overlay2/1333522faba816e0b61ac33b484328ad8d8422c797c48470b3adabba890660eb/diff:/var/lib/docker/overlay2/d646b22eeb7590a13ed22a6fbb909ff702580e6e728609a597e451af0d5e03cf/diff:/var/lib/docker/overlay2/1062b70b19dff8afe284d195299e81826f0ce5f20004b42406320a05e48d9ca4/diff:/var/lib/docker/overlay2/93a7eb0237d95369d64caa402907dc4b16f82ccced8423bafbc9a8aa26e5eeda/diff:/var/lib/docker/overlay2/32a07cfc772d82b895a6fcd4019d007501d51f3bbc91ede16f3a7592a207e63e/diff",
                "MergedDir": "/var/lib/docker/overlay2/99f176c4457ecbc3eb35beb6ae2e7c8c98c5897c6a74ed95d4a617f239051f84/merged",
                "UpperDir": "/var/lib/docker/overlay2/99f176c4457ecbc3eb35beb6ae2e7c8c98c5897c6a74ed95d4a617f239051f84/diff",
                "WorkDir": "/var/lib/docker/overlay2/99f176c4457ecbc3eb35beb6ae2e7c8c98c5897c6a74ed95d4a617f239051f84/work"
            },
            "Name": "overlay2"
        },
        "Mounts": [],
        "Config": {
            "Hostname": "ng-nx6n7s4yyi-6f812",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "HABANA_VISIBLE_DEVICES=all",
                "OMPI_MCA_btl_vader_single_copy_mechanism=none",
                "PATH=/opt/habanalabs/libfabric-1.22.0/bin:/opt/amazon/openmpi/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "DEBIAN_FRONTEND=noninteractive",
                "GC_KERNEL_PATH=/usr/lib/habanalabs/libtpc_kernels.so",
                "HABANA_LOGS=/var/log/habana_logs/",
                "OS_NUMBER=2204",
                "HABANA_SCAL_BIN_PATH=/opt/habanalabs/engines_fw",
                "HABANA_PLUGINS_LIB_PATH=/opt/habanalabs/habana_plugins",
                "PIP_NO_CACHE_DIR=on",
                "PIP_DEFAULT_TIMEOUT=1000",
                "PIP_DISABLE_PIP_VERSION_CHECK=1",
                "LIBFABRIC_VERSION=1.22.0",
                "LIBFABRIC_ROOT=/opt/habanalabs/libfabric-1.22.0",
                "MPI_ROOT=/opt/amazon/openmpi",
                "LD_LIBRARY_PATH=/opt/habanalabs/libfabric-1.22.0/lib:/opt/amazon/openmpi/lib:/usr/lib/habanalabs:",
                "OPAL_PREFIX=/opt/amazon/openmpi",
                "MPICC=/opt/amazon/openmpi/bin/mpicc",
                "RDMAV_FORK_SAFE=1",
                "FI_EFA_USE_DEVICE_RDMA=1",
                "RDMA_CORE_ROOT=/opt/habanalabs/rdma-core/src",
                "RDMA_CORE_LIB=/opt/habanalabs/rdma-core/src/build/lib",
                "PYTHONPATH=/root:/usr/lib/habanalabs/",
                "LD_PRELOAD=/lib/x86_64-linux-gnu/libtcmalloc.so.4",
                "TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD=7516192768"
            ],
            "Cmd": [
                "sleep",
                "infinity"
            ],
            "Image": "vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest",
            "Volumes": null,
            "WorkingDir": "",
            "Entrypoint": null,
            "OnBuild": null,
            "Labels": {
                "org.opencontainers.image.ref.name": "ubuntu",
                "org.opencontainers.image.version": "22.04"
            }
        },
        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "2663a081212073036fdafcefd1d0d214b1f4d0023adfc9c20f49e9ff3f4dbec4",
            "SandboxKey": "/var/run/docker/netns/default",
            "Ports": {},
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "EndpointID": "",
            "Gateway": "",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "IPAddress": "",
            "IPPrefixLen": 0,
            "IPv6Gateway": "",
            "MacAddress": "",
            "Networks": {
                "host": {
                    "IPAMConfig": null,
                    "Links": null,
                    "Aliases": null,
                    "MacAddress": "",
                    "DriverOpts": null,
                    "NetworkID": "528b711e870bfc399f943e2db9499cd9993d5441e6ac2bec38e54a4862f39af1",
                    "EndpointID": "ff60fa70e73a35564bb4f73bd2bc34c86c7d0466f27bd90f346aa2d969604538",
                    "Gateway": "",
                    "IPAddress": "",
                    "IPPrefixLen": 0,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "DNSNames": null
                }
            }
        }
    }
]

Thanks again for the support.

0 Kudos
James_Edwards
Employee
2,071 Views

I have been unable to find anything wrong with the docker container configuration file you sent me; it is nearly identical to a working container I have executed. The only thing I can think of to do is to install the 1.19.1 habanalabs-container runtime and see if I can reproduce your issue. In the meantime, the only possible solution I can suggest is to install the new 1.19.2 Gaudi software stack (released yesterday) and see if that resolves your issue.

0 Kudos
Gera_Dmz
Employee
1,825 Views

Hello @James_Edwards , we've upgraded Gaudi SW stack to 1.19.2. Unfortunately the behavior is the same, we're still getting:

# hl-smi
habanalabs driver is not loaded or no AIPs available, aborting...
0 Kudos
Gera_Dmz
Employee
1,994 Views

Okey, it'd be great if you can get to replicate the error. From our side we can upgrade to 1.19.2.

0 Kudos
James_Edwards
Employee
1,747 Views

I updated a system to the 1.19.2 version of Intel Gaudi software and configured the habanalabs-container-runtime (at version1.19.2-32), as specified in the comments above. I started a docker container with the following command:

.

docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.2/ubuntu24.04/habanalabs/pytorch-installer-2.5.1:latest

.

All devices were available in the container an hl-smi gave no errors. Basically, I was unable to reproduce the problem with the latest software.

.

Try executing the docker command with the --privileged flag not using the habana runtime:

.

docker run -it  --privileged  --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.2/ubuntu24.04/habanalabs/pytorch-installer-2.5.1:latest

.

I know this is brute force, but it will tell us if the runtime is the issue.

 

0 Kudos
Gera_Dmz
Employee
1,721 Views

Interesting. Running:

docker run -it  --privileged  --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.2/ubuntu24.04/habanalabs/pytorch-installer-2.5.1:latest

I'm actually able to run hl-smi inside the container:

# hl-smi
+-----------------------------------------------------------------------------+
| HL-SMI Version:                              hl-1.19.2-fw-57.2.4.0          |
| Driver Version:                                     1.19.2-ff37fea          |
|-------------------------------+----------------------+----------------------+
| AIP  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncor-Events|
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | AIP-Util  Compute M. |
|===============================+======================+======================|
|   0  HL-225              N/A  | 0000:33:00.0     N/A |                   0  |
| N/A   20C   N/A  86W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   1  HL-225              N/A  | 0000:9a:00.0     N/A |                   0  |
| N/A   21C   N/A  92W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   2  HL-225              N/A  | 0000:9b:00.0     N/A |                   0  |
| N/A   24C   N/A 100W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   3  HL-225              N/A  | 0000:34:00.0     N/A |                   0  |
| N/A   23C   N/A  75W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   4  HL-225              N/A  | 0000:b3:00.0     N/A |                   0  |
| N/A   23C   N/A  82W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   5  HL-225              N/A  | 0000:4d:00.0     N/A |                   0  |
| N/A   23C   N/A  93W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   6  HL-225              N/A  | 0000:4e:00.0     N/A |                   0  |
| N/A   22C   N/A  80W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   7  HL-225              N/A  | 0000:b4:00.0     N/A |                   0  |
| N/A   21C   N/A  62W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes:                                               AIP Memory |
|  AIP       PID   Type   Process name                             Usage      |
|=============================================================================|
|   0        N/A   N/A    N/A                                      N/A        |
|   1        N/A   N/A    N/A                                      N/A        |
|   2        N/A   N/A    N/A                                      N/A        |
|   3        N/A   N/A    N/A                                      N/A        |
|   4        N/A   N/A    N/A                                      N/A        |
|   5        N/A   N/A    N/A                                      N/A        |
|   6        N/A   N/A    N/A                                      N/A        |
|   7        N/A   N/A    N/A                                      N/A        |
+=============================================================================+

Could this be on how the SW stack was installed? We didn't use sudo for this though 

0 Kudos
James_Edwards
Employee
1,674 Views

This tells me that your Intel Gaudi software drivers are installed correctly, have the correct permissions and are available to docker. It seems like your problem is with the habana container runtime and how it initializes the devices in the container's environment. It could be how the container runtime was installed and configured, but we reinstalled and checked the configuration of a container; nothing looked wrong. I will check with the ACE team for next steps.

0 Kudos
Gera_Dmz
Employee
978 Views

Hello @James_Edwards , do you have any update or new insight on the matter? Did the ACE team give support?

0 Kudos
James_Edwards
Employee
968 Views

Sorry for not updating the post sooner. The ACE team had never encountered this issue before and were unable to reproduce this as well. I will see if I can get any information from them on debugging.

0 Kudos
James_Edwards
Employee
945 Views

Can we confirm that your system is using cgroups version 2? Run:

`grep cgroup /proc/filesystems`

and paste the output.

0 Kudos
Gera_Dmz
Employee
941 Views

What I get is:

$ grep cgroup /proc/filesystems
nodev   cgroup
nodev   cgroup2
0 Kudos
Gera_Dmz
Employee
839 Views

Thanks. Same as 1663854I don't have access to Habana's Jira, so please keep me posted if there's any update on the issue.

0 Kudos
AungSan
Employee
641 Views

Hi @Gera_Dmz ,

Please share all installed packages, 
sudo apt list --installed | grep habana

Uninstall all dockers and any k8s packages if installed. 

Reboot the system. 

Install dockers packages and Habana packages and try again.  

0 Kudos
AungSan
Employee
637 Views

Hi @Gera_Dmz ,
Please install habana-container-runtime prebuild from repo. 


0 Kudos
Reply