Intel® Gaudi® AI Accelerator
Support for the Intel® Gaudi® AI Accelerator
10 Discussions

Error starting containers using habana-container-runtime

Gera_Dmz
Employee
1,251 Views

I am experiencing an issue where I am unable to access Gaudi accelerators when creating a Docker container using the Habana runtime.

 

Steps to reproduce:

  1. Installed Gaudi drivers & Software.
  2. Built binaries.
  3. Configured both /etc/docker/daemon.json & /etc/containerd/config.toml.
  4. Ran docker run --rm --runtime=habana -e HABANA_VISIBLE_DEVICES=all ubuntu:22.04 /bin/bash -c "ls /dev/accel/*" and got:

 

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: exposing interfaces: failed creating temporary link on host: invalid argument
exit status 1: unknown.​

 

  • Tried docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest but also got:

 

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: exposing interfaces: failed creating temporary link on host: invalid argument
exit status 1: unknown.​

 

  • Removing -e HABANA_VISIBLE_DEVICES=all I'm able to exec into the container, but the accelerators are not visible inside the container:

 

# hl-smi
habanalabs driver is not loaded or no AIPs available, aborting...
# ls /dev/accel
ls: cannot access '/dev/accel': No such file or directory​

 

OS
Ubuntu 22.04.4 LTS

Kernel Version
5.15.0-117-generic

Container Runtime Type/Version
1.19.1

K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS)
Docker version 27.5.0

Extra logs and files
From the host machine:

 

$ hl-smi
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.19.1-fw-57.2.2.0 |
| Driver Version: 1.19.1-6f47ddd |
|-------------------------------+----------------------+----------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncor-Events|
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-225 N/A | 0000:33:00.0 N/A | 0 |
| N/A 24C N/A 88W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 1 HL-225 N/A | 0000:9a:00.0 N/A | 0 |
| N/A 25C N/A 92W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 2 HL-225 N/A | 0000:34:00.0 N/A | 0 |
| N/A 26C N/A 76W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 3 HL-225 N/A | 0000:9b:00.0 N/A | 0 |
| N/A 27C N/A 102W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 4 HL-225 N/A | 0000:4d:00.0 N/A | 0 |
| N/A 27C N/A 90W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 5 HL-225 N/A | 0000:4e:00.0 N/A | 0 |
| N/A 25C N/A 82W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 6 HL-225 N/A | 0000:b4:00.0 N/A | 0 |
| N/A 25C N/A 65W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 7 HL-225 N/A | 0000:b3:00.0 N/A | 0 |
| N/A 27C N/A 84W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 N/A N/A N/A N/A |
| 1 N/A N/A N/A N/A |
| 2 N/A N/A N/A N/A |
| 3 N/A N/A N/A N/A |
| 4 N/A N/A N/A N/A |
| 5 N/A N/A N/A N/A |
| 6 N/A N/A N/A N/A |
| 7 N/A N/A N/A N/A |
+=============================================================================+

$ tail -n 1 /var/log/habana-container-runtime.log
{"time":"2025-01-22T22:10:20.545416796Z","level":"INFO","msg":"file does not exist on host: /etc/habanalabs/gaudinet.json"}

$ tail -n 1 /var/log/habana-container-hook.log
{"time":"2025-01-22T22:10:20.569909471Z","level":"ERROR","msg":"exposing interfaces: failed creating temporary link on host: invalid argument"}

 

0 Kudos
12 Replies
James_Edwards
Employee
1,176 Views

This looks like an issue with your installation or configuration of the habana-container-runtime. On the system, what is the output of:

`dpkg -l | grep habanalabs-container-runtime`

I am particularly interested in the version.

 

Also, If you are using docker, remove the containerd toml file and check your docker configuration file again. The /etc/docker/daemon.json file should look similar to this:

 

{
   "default-runtime": "habana",
   "runtimes": {
      "habana": {
         "path": "/usr/bin/habana-container-runtime",
         "runtimeArgs": []
      }
   }
}

 Then restart the docker service:

.

sudo systemctl restart docker

 

0 Kudos
Gera_Dmz
Employee
1,152 Views

Thanks @James_Edwards for replying. The installed version is:

 

$ dpkg -l | grep habanalabs-container-runtime
ii  habanalabs-container-runtime              1.19.1-26                                   amd64        HABANA container runtime

 

 

/etc/docker/daemon.json is as follows:

 

$ cat /etc/docker/daemon.json
{"runtimes": {"habana": {"path": "/usr/bin/habana-container-runtime", "runtimeArgs": []}}, "default-runtime": "habana"}

 

 

Removing /etc/containerd/config.toml did help on being able to exec into the container using docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest. But inside the container I'm still getting:

 

# hl-smi
habanalabs driver is not loaded or no AIPs available, aborting...

 

 

0 Kudos
James_Edwards
Employee
1,146 Views

1) Check the output of the habana-container-hook.log and the habana-container-runtime.log again to see if there are any other errors associated with the docker server. (By the way, the gaudinet.json file is not required for single gaudi nodes, and that error is not important. The ERROR message in the habana-container-runtime.log was a problem, however).

2) Start the docker container with the -d option and then run 'docker logs <container id>' to see if there are any errors on container issues on startup.

3) On the host get the permissions on the Gaudi accelerator devices: 'ls -l /dev/accel'

.

0 Kudos
Gera_Dmz
Employee
1,142 Views

1) Checking habana-container-hook.log I see this new error:

$ tail -n 1 /var/log/habana-container-hook.log
{"time":"2025-02-10T22:44:50.45307364Z","level":"INFO","msg":"device already exists in namespace. Host network used?"}

 2) I don't get any logs when using the detached option:

$ docker run -d --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest sleep infinity
b8ca5e99989bef171fde1355f9e40c78f77914c96184cfa52e3f877c95990981
$ docker ps
CONTAINER ID   IMAGE                                                                                       COMMAND            CREATED         STATUS         PORTS     NAMES
b8ca5e99989b   vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest   "sleep infinity"   3 seconds ago   Up 3 seconds             romantic_williams
$ docker logs b8ca5e99989b

3) The permissions are:

$ ls -l /dev/accel/
total 0
crw-rw-rw- 1 root root 509,  0 Jan 14 23:23 accel0
crw-rw-rw- 1 root root 509,  2 Jan 14 23:23 accel1
crw-rw-rw- 1 root root 509,  4 Jan 14 23:23 accel2
crw-rw-rw- 1 root root 509,  6 Jan 14 23:23 accel3
crw-rw-rw- 1 root root 509,  8 Jan 14 23:23 accel4
crw-rw-rw- 1 root root 509, 10 Jan 14 23:23 accel5
crw-rw-rw- 1 root root 509, 12 Jan 14 23:23 accel6
crw-rw-rw- 1 root root 509, 14 Jan 14 23:23 accel7
crw-rw-rw- 1 root root 509,  1 Jan 14 23:23 accel_controlD0
crw-rw-rw- 1 root root 509,  3 Jan 14 23:23 accel_controlD1
crw-rw-rw- 1 root root 509,  5 Jan 14 23:23 accel_controlD2
crw-rw-rw- 1 root root 509,  7 Jan 14 23:23 accel_controlD3
crw-rw-rw- 1 root root 509,  9 Jan 14 23:23 accel_controlD4
crw-rw-rw- 1 root root 509, 11 Jan 14 23:23 accel_controlD5
crw-rw-rw- 1 root root 509, 13 Jan 14 23:23 accel_controlD6
crw-rw-rw- 1 root root 509, 15 Jan 14 23:23 accel_controlD7

 

0 Kudos
James_Edwards
Employee
1,104 Views

The device permissions all seem good. I was also able to run your docker command and get all devices (I used the 1.19.0 version of the container runtime). This still seems like an issue with the configuration of the container runtime. What does this command say:

.

docker inspect <container_id> | grep Runtime

.

If you don't get a string that says "Runtime": "habana" there is a configuration error still. Post the entire output of docker inspect so I can look at it otherwise.

0 Kudos
Gera_Dmz
Employee
1,055 Views

I do see "Runtime": "habana":

$ docker ps
CONTAINER ID   IMAGE                                                                                       COMMAND            CREATED              STATUS              PORTS     NAMES
eb0bf40f6b1d   vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest   "sleep infinity"   About a minute ago   Up About a minute             boring_lehmann
$ docker inspect eb0bf40f6b1d | grep Runtime
            "Runtime": "habana",
            "CpuRealtimeRuntime": 0,

Inspecting the container:

$ docker inspect eb0bf40f6b1d
[
    {
        "Id": "eb0bf40f6b1d0392e38e9e1542650fab9df37e851d4d74388dfe2d904fcabdce",
        "Created": "2025-02-11T16:06:30.690094085Z",
        "Path": "sleep",
        "Args": [
            "infinity"
        ],
        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 31171,
            "ExitCode": 0,11:30 AM 2/11/2025
            "Error": "",
            "StartedAt": "2025-02-11T16:06:33.601305388Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        },
        "Image": "sha256:1d0c9dbfacfdffb8ed752b9f56519c3290645bb4ec81542a7256feb18f65c9a3",
        "ResolvConfPath": "/var/lib/docker/containers/eb0bf40f6b1d0392e38e9e1542650fab9df37e851d4d74388dfe2d904fcabdce/resolv.conf",
        "HostnamePath": "/var/lib/docker/containers/eb0bf40f6b1d0392e38e9e1542650fab9df37e851d4d74388dfe2d904fcabdce/hostname",
        "HostsPath": "/var/lib/docker/containers/eb0bf40f6b1d0392e38e9e1542650fab9df37e851d4d74388dfe2d904fcabdce/hosts",
        "LogPath": "/var/lib/docker/containers/eb0bf40f6b1d0392e38e9e1542650fab9df37e851d4d74388dfe2d904fcabdce/eb0bf40f6b1d0392e38e9e1542650fab9df37e851d4d74388dfe2d904fcabdce-json.log",
        "Name": "/boring_lehmann",
        "RestartCount": 0,
        "Driver": "overlay2",
        "Platform": "linux",
        "MountLabel": "",
        "ProcessLabel": "",
        "AppArmorProfile": "docker-default",
        "ExecIDs": null,
        "HostConfig": {
            "Binds": null,
            "ContainerIDFile": "",
            "LogConfig": {
                "Type": "json-file",
                "Config": {}
            },
            "NetworkMode": "host",
            "PortBindings": {},
            "RestartPolicy": {
                "Name": "no",
                "MaximumRetryCount": 0
            },
            "AutoRemove": false,
            "VolumeDriver": "",
            "VolumesFrom": null,
            "ConsoleSize": [
                49,
                189
            ],
            "CapAdd": [
                "sys_nice"
            ],
            "CapDrop": null,
            "CgroupnsMode": "private",
            "Dns": [],
            "DnsOptions": [],
            "DnsSearch": [],
            "ExtraHosts": null,
            "GroupAdd": null,
            "IpcMode": "host",
            "Cgroup": "",
            "Links": null,
            "OomScoreAdj": 0,
            "PidMode": "",
            "Privileged": false,
            "PublishAllPorts": false,
            "ReadonlyRootfs": false,
            "SecurityOpt": [
                "label=disable"
            ],
            "UTSMode": "",
            "UsernsMode": "",
            "ShmSize": 67108864,
            "Runtime": "habana",
            "Isolation": "",
            "CpuShares": 0,
            "Memory": 0,
            "NanoCpus": 0,
            "CgroupParent": "",
            "BlkioWeight": 0,
            "BlkioWeightDevice": [],
            "BlkioDeviceReadBps": [],
            "BlkioDeviceWriteBps": [],
            "BlkioDeviceReadIOps": [],
            "BlkioDeviceWriteIOps": [],
            "CpuPeriod": 0,
            "CpuQuota": 0,
            "CpuRealtimePeriod": 0,
            "CpuRealtimeRuntime": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
            "Devices": [],
            "DeviceCgroupRules": null,
            "DeviceRequests": null,
            "MemoryReservation": 0,
            "MemorySwap": 0,
            "MemorySwappiness": null,
            "OomKillDisable": null,
            "PidsLimit": null,
            "Ulimits": [],
            "CpuCount": 0,
            "CpuPercent": 0,
            "IOMaximumIOps": 0,
            "IOMaximumBandwidth": 0,
            "MaskedPaths": [
                "/proc/asound",
                "/proc/acpi",
                "/proc/kcore",
                "/proc/keys",
                "/proc/latency_stats",
                "/proc/timer_list",
                "/proc/timer_stats",
                "/proc/sched_debug",
                "/proc/scsi",
                "/sys/firmware",
                "/sys/devices/virtual/powercap"
            ],
            "ReadonlyPaths": [
                "/proc/bus",
                "/proc/fs",
                "/proc/irq",
                "/proc/sys",
                "/proc/sysrq-trigger"
            ]
        },
        "GraphDriver": {
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/99f176c4457ecbc3eb35beb6ae2e7c8c98c5897c6a74ed95d4a617f239051f84-init/diff:/var/lib/docker/overlay2/84a1dd5c7734190f9d1d61036a16e11c84e61574e09e64eec206d0c32b48054a/diff:/var/lib/docker/overlay2/f927604ccd4f437148a7c90d3e4e6f42266d84bd4ef415e807e18ccaf67362d5/diff:/var/lib/docker/overlay2/4ef1a9882c291c23760de9b0a74e196e230f1f7c37f61ec38db07539f4bd7f78/diff:/var/lib/docker/overlay2/720dd7a239d971dfc9243be25f090392ab5d48825897aef448d1b3a867c397d5/diff:/var/lib/docker/overlay2/65ff9290ebd144007a656959a1bbf2aa2c147fef0bc0f2af441bcd7077c53a18/diff:/var/lib/docker/overlay2/9b6f21580d4adc9da99557c895d6f271116421007d4a7a1c23d7f238ce41dd57/diff:/var/lib/docker/overlay2/3436c6314fa980e97b2f1fad0df95cb41caf626a7f776852405964dbc76d8d6a/diff:/var/lib/docker/overlay2/e426528d3b3300f3087fb6d1a59f72a243ed57a722f576b4f94f8574ef6e9e28/diff:/var/lib/docker/overlay2/169b6c543d8a5bd1ca87e26c1c2050f8e95635eb8e26122c97a6811cefa434d5/diff:/var/lib/docker/overlay2/1333522faba816e0b61ac33b484328ad8d8422c797c48470b3adabba890660eb/diff:/var/lib/docker/overlay2/d646b22eeb7590a13ed22a6fbb909ff702580e6e728609a597e451af0d5e03cf/diff:/var/lib/docker/overlay2/1062b70b19dff8afe284d195299e81826f0ce5f20004b42406320a05e48d9ca4/diff:/var/lib/docker/overlay2/93a7eb0237d95369d64caa402907dc4b16f82ccced8423bafbc9a8aa26e5eeda/diff:/var/lib/docker/overlay2/32a07cfc772d82b895a6fcd4019d007501d51f3bbc91ede16f3a7592a207e63e/diff",
                "MergedDir": "/var/lib/docker/overlay2/99f176c4457ecbc3eb35beb6ae2e7c8c98c5897c6a74ed95d4a617f239051f84/merged",
                "UpperDir": "/var/lib/docker/overlay2/99f176c4457ecbc3eb35beb6ae2e7c8c98c5897c6a74ed95d4a617f239051f84/diff",
                "WorkDir": "/var/lib/docker/overlay2/99f176c4457ecbc3eb35beb6ae2e7c8c98c5897c6a74ed95d4a617f239051f84/work"
            },
            "Name": "overlay2"
        },
        "Mounts": [],
        "Config": {
            "Hostname": "ng-nx6n7s4yyi-6f812",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "HABANA_VISIBLE_DEVICES=all",
                "OMPI_MCA_btl_vader_single_copy_mechanism=none",
                "PATH=/opt/habanalabs/libfabric-1.22.0/bin:/opt/amazon/openmpi/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "DEBIAN_FRONTEND=noninteractive",
                "GC_KERNEL_PATH=/usr/lib/habanalabs/libtpc_kernels.so",
                "HABANA_LOGS=/var/log/habana_logs/",
                "OS_NUMBER=2204",
                "HABANA_SCAL_BIN_PATH=/opt/habanalabs/engines_fw",
                "HABANA_PLUGINS_LIB_PATH=/opt/habanalabs/habana_plugins",
                "PIP_NO_CACHE_DIR=on",
                "PIP_DEFAULT_TIMEOUT=1000",
                "PIP_DISABLE_PIP_VERSION_CHECK=1",
                "LIBFABRIC_VERSION=1.22.0",
                "LIBFABRIC_ROOT=/opt/habanalabs/libfabric-1.22.0",
                "MPI_ROOT=/opt/amazon/openmpi",
                "LD_LIBRARY_PATH=/opt/habanalabs/libfabric-1.22.0/lib:/opt/amazon/openmpi/lib:/usr/lib/habanalabs:",
                "OPAL_PREFIX=/opt/amazon/openmpi",
                "MPICC=/opt/amazon/openmpi/bin/mpicc",
                "RDMAV_FORK_SAFE=1",
                "FI_EFA_USE_DEVICE_RDMA=1",
                "RDMA_CORE_ROOT=/opt/habanalabs/rdma-core/src",
                "RDMA_CORE_LIB=/opt/habanalabs/rdma-core/src/build/lib",
                "PYTHONPATH=/root:/usr/lib/habanalabs/",
                "LD_PRELOAD=/lib/x86_64-linux-gnu/libtcmalloc.so.4",
                "TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD=7516192768"
            ],
            "Cmd": [
                "sleep",
                "infinity"
            ],
            "Image": "vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest",
            "Volumes": null,
            "WorkingDir": "",
            "Entrypoint": null,
            "OnBuild": null,
            "Labels": {
                "org.opencontainers.image.ref.name": "ubuntu",
                "org.opencontainers.image.version": "22.04"
            }
        },
        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "2663a081212073036fdafcefd1d0d214b1f4d0023adfc9c20f49e9ff3f4dbec4",
            "SandboxKey": "/var/run/docker/netns/default",
            "Ports": {},
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "EndpointID": "",
            "Gateway": "",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "IPAddress": "",
            "IPPrefixLen": 0,
            "IPv6Gateway": "",
            "MacAddress": "",
            "Networks": {
                "host": {
                    "IPAMConfig": null,
                    "Links": null,
                    "Aliases": null,
                    "MacAddress": "",
                    "DriverOpts": null,
                    "NetworkID": "528b711e870bfc399f943e2db9499cd9993d5441e6ac2bec38e54a4862f39af1",
                    "EndpointID": "ff60fa70e73a35564bb4f73bd2bc34c86c7d0466f27bd90f346aa2d969604538",
                    "Gateway": "",
                    "IPAddress": "",
                    "IPPrefixLen": 0,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "DNSNames": null
                }
            }
        }
    }
]

Thanks again for the support.

0 Kudos
James_Edwards
Employee
1,043 Views

I have been unable to find anything wrong with the docker container configuration file you sent me; it is nearly identical to a working container I have executed. The only thing I can think of to do is to install the 1.19.1 habanalabs-container runtime and see if I can reproduce your issue. In the meantime, the only possible solution I can suggest is to install the new 1.19.2 Gaudi software stack (released yesterday) and see if that resolves your issue.

0 Kudos
Gera_Dmz
Employee
797 Views

Hello @James_Edwards , we've upgraded Gaudi SW stack to 1.19.2. Unfortunately the behavior is the same, we're still getting:

# hl-smi
habanalabs driver is not loaded or no AIPs available, aborting...
0 Kudos
Gera_Dmz
Employee
966 Views

Okey, it'd be great if you can get to replicate the error. From our side we can upgrade to 1.19.2.

0 Kudos
James_Edwards
Employee
719 Views

I updated a system to the 1.19.2 version of Intel Gaudi software and configured the habanalabs-container-runtime (at version1.19.2-32), as specified in the comments above. I started a docker container with the following command:

.

docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.2/ubuntu24.04/habanalabs/pytorch-installer-2.5.1:latest

.

All devices were available in the container an hl-smi gave no errors. Basically, I was unable to reproduce the problem with the latest software.

.

Try executing the docker command with the --privileged flag not using the habana runtime:

.

docker run -it  --privileged  --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.2/ubuntu24.04/habanalabs/pytorch-installer-2.5.1:latest

.

I know this is brute force, but it will tell us if the runtime is the issue.

 

0 Kudos
Gera_Dmz
Employee
693 Views

Interesting. Running:

docker run -it  --privileged  --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.2/ubuntu24.04/habanalabs/pytorch-installer-2.5.1:latest

I'm actually able to run hl-smi inside the container:

# hl-smi
+-----------------------------------------------------------------------------+
| HL-SMI Version:                              hl-1.19.2-fw-57.2.4.0          |
| Driver Version:                                     1.19.2-ff37fea          |
|-------------------------------+----------------------+----------------------+
| AIP  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncor-Events|
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | AIP-Util  Compute M. |
|===============================+======================+======================|
|   0  HL-225              N/A  | 0000:33:00.0     N/A |                   0  |
| N/A   20C   N/A  86W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   1  HL-225              N/A  | 0000:9a:00.0     N/A |                   0  |
| N/A   21C   N/A  92W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   2  HL-225              N/A  | 0000:9b:00.0     N/A |                   0  |
| N/A   24C   N/A 100W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   3  HL-225              N/A  | 0000:34:00.0     N/A |                   0  |
| N/A   23C   N/A  75W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   4  HL-225              N/A  | 0000:b3:00.0     N/A |                   0  |
| N/A   23C   N/A  82W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   5  HL-225              N/A  | 0000:4d:00.0     N/A |                   0  |
| N/A   23C   N/A  93W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   6  HL-225              N/A  | 0000:4e:00.0     N/A |                   0  |
| N/A   22C   N/A  80W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   7  HL-225              N/A  | 0000:b4:00.0     N/A |                   0  |
| N/A   21C   N/A  62W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes:                                               AIP Memory |
|  AIP       PID   Type   Process name                             Usage      |
|=============================================================================|
|   0        N/A   N/A    N/A                                      N/A        |
|   1        N/A   N/A    N/A                                      N/A        |
|   2        N/A   N/A    N/A                                      N/A        |
|   3        N/A   N/A    N/A                                      N/A        |
|   4        N/A   N/A    N/A                                      N/A        |
|   5        N/A   N/A    N/A                                      N/A        |
|   6        N/A   N/A    N/A                                      N/A        |
|   7        N/A   N/A    N/A                                      N/A        |
+=============================================================================+

Could this be on how the SW stack was installed? We didn't use sudo for this though 

0 Kudos
James_Edwards
Employee
646 Views

This tells me that your Intel Gaudi software drivers are installed correctly, have the correct permissions and are available to docker. It seems like your problem is with the habana container runtime and how it initializes the devices in the container's environment. It could be how the container runtime was installed and configured, but we reinstalled and checked the configuration of a container; nothing looked wrong. I will check with the ACE team for next steps.

0 Kudos
Reply