- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am experiencing an issue where I am unable to access Gaudi accelerators when creating a Docker container using the Habana runtime.
Steps to reproduce:
- Installed Gaudi drivers & Software.
- Built binaries.
- Configured both /etc/docker/daemon.json & /etc/containerd/config.toml.
- Ran docker run --rm --runtime=habana -e HABANA_VISIBLE_DEVICES=all ubuntu:22.04 /bin/bash -c "ls /dev/accel/*" and got:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: exposing interfaces: failed creating temporary link on host: invalid argument
exit status 1: unknown.
- Tried docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest but also got:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: exposing interfaces: failed creating temporary link on host: invalid argument
exit status 1: unknown.
- Removing -e HABANA_VISIBLE_DEVICES=all I'm able to exec into the container, but the accelerators are not visible inside the container:
# hl-smi
habanalabs driver is not loaded or no AIPs available, aborting...
# ls /dev/accel
ls: cannot access '/dev/accel': No such file or directory
OS
Ubuntu 22.04.4 LTS
Kernel Version
5.15.0-117-generic
Container Runtime Type/Version
1.19.1
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS)
Docker version 27.5.0
Extra logs and files
From the host machine:
$ hl-smi
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.19.1-fw-57.2.2.0 |
| Driver Version: 1.19.1-6f47ddd |
|-------------------------------+----------------------+----------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncor-Events|
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-225 N/A | 0000:33:00.0 N/A | 0 |
| N/A 24C N/A 88W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 1 HL-225 N/A | 0000:9a:00.0 N/A | 0 |
| N/A 25C N/A 92W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 2 HL-225 N/A | 0000:34:00.0 N/A | 0 |
| N/A 26C N/A 76W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 3 HL-225 N/A | 0000:9b:00.0 N/A | 0 |
| N/A 27C N/A 102W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 4 HL-225 N/A | 0000:4d:00.0 N/A | 0 |
| N/A 27C N/A 90W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 5 HL-225 N/A | 0000:4e:00.0 N/A | 0 |
| N/A 25C N/A 82W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 6 HL-225 N/A | 0000:b4:00.0 N/A | 0 |
| N/A 25C N/A 65W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 7 HL-225 N/A | 0000:b3:00.0 N/A | 0 |
| N/A 27C N/A 84W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 N/A N/A N/A N/A |
| 1 N/A N/A N/A N/A |
| 2 N/A N/A N/A N/A |
| 3 N/A N/A N/A N/A |
| 4 N/A N/A N/A N/A |
| 5 N/A N/A N/A N/A |
| 6 N/A N/A N/A N/A |
| 7 N/A N/A N/A N/A |
+=============================================================================+
$ tail -n 1 /var/log/habana-container-runtime.log
{"time":"2025-01-22T22:10:20.545416796Z","level":"INFO","msg":"file does not exist on host: /etc/habanalabs/gaudinet.json"}
$ tail -n 1 /var/log/habana-container-hook.log
{"time":"2025-01-22T22:10:20.569909471Z","level":"ERROR","msg":"exposing interfaces: failed creating temporary link on host: invalid argument"}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This looks like an issue with your installation or configuration of the habana-container-runtime. On the system, what is the output of:
`dpkg -l | grep habanalabs-container-runtime`
I am particularly interested in the version.
Also, If you are using docker, remove the containerd toml file and check your docker configuration file again. The /etc/docker/daemon.json file should look similar to this:
{ "default-runtime": "habana", "runtimes": { "habana": { "path": "/usr/bin/habana-container-runtime", "runtimeArgs": [] } } }
Then restart the docker service:
.
sudo systemctl restart docker
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @James_Edwards for replying. The installed version is:
$ dpkg -l | grep habanalabs-container-runtime
ii habanalabs-container-runtime 1.19.1-26 amd64 HABANA container runtime
/etc/docker/daemon.json is as follows:
$ cat /etc/docker/daemon.json
{"runtimes": {"habana": {"path": "/usr/bin/habana-container-runtime", "runtimeArgs": []}}, "default-runtime": "habana"}
Removing /etc/containerd/config.toml did help on being able to exec into the container using docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest. But inside the container I'm still getting:
# hl-smi
habanalabs driver is not loaded or no AIPs available, aborting...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1) Check the output of the habana-container-hook.log and the habana-container-runtime.log again to see if there are any other errors associated with the docker server. (By the way, the gaudinet.json file is not required for single gaudi nodes, and that error is not important. The ERROR message in the habana-container-runtime.log was a problem, however).
2) Start the docker container with the -d option and then run 'docker logs <container id>' to see if there are any errors on container issues on startup.
3) On the host get the permissions on the Gaudi accelerator devices: 'ls -l /dev/accel'
.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1) Checking habana-container-hook.log I see this new error:
$ tail -n 1 /var/log/habana-container-hook.log
{"time":"2025-02-10T22:44:50.45307364Z","level":"INFO","msg":"device already exists in namespace. Host network used?"}
2) I don't get any logs when using the detached option:
$ docker run -d --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest sleep infinity
b8ca5e99989bef171fde1355f9e40c78f77914c96184cfa52e3f877c95990981
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b8ca5e99989b vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest "sleep infinity" 3 seconds ago Up 3 seconds romantic_williams
$ docker logs b8ca5e99989b
3) The permissions are:
$ ls -l /dev/accel/
total 0
crw-rw-rw- 1 root root 509, 0 Jan 14 23:23 accel0
crw-rw-rw- 1 root root 509, 2 Jan 14 23:23 accel1
crw-rw-rw- 1 root root 509, 4 Jan 14 23:23 accel2
crw-rw-rw- 1 root root 509, 6 Jan 14 23:23 accel3
crw-rw-rw- 1 root root 509, 8 Jan 14 23:23 accel4
crw-rw-rw- 1 root root 509, 10 Jan 14 23:23 accel5
crw-rw-rw- 1 root root 509, 12 Jan 14 23:23 accel6
crw-rw-rw- 1 root root 509, 14 Jan 14 23:23 accel7
crw-rw-rw- 1 root root 509, 1 Jan 14 23:23 accel_controlD0
crw-rw-rw- 1 root root 509, 3 Jan 14 23:23 accel_controlD1
crw-rw-rw- 1 root root 509, 5 Jan 14 23:23 accel_controlD2
crw-rw-rw- 1 root root 509, 7 Jan 14 23:23 accel_controlD3
crw-rw-rw- 1 root root 509, 9 Jan 14 23:23 accel_controlD4
crw-rw-rw- 1 root root 509, 11 Jan 14 23:23 accel_controlD5
crw-rw-rw- 1 root root 509, 13 Jan 14 23:23 accel_controlD6
crw-rw-rw- 1 root root 509, 15 Jan 14 23:23 accel_controlD7
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The device permissions all seem good. I was also able to run your docker command and get all devices (I used the 1.19.0 version of the container runtime). This still seems like an issue with the configuration of the container runtime. What does this command say:
.
docker inspect <container_id> | grep Runtime
.
If you don't get a string that says "Runtime": "habana" there is a configuration error still. Post the entire output of docker inspect so I can look at it otherwise.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I do see "Runtime": "habana":
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
eb0bf40f6b1d vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest "sleep infinity" About a minute ago Up About a minute boring_lehmann
$ docker inspect eb0bf40f6b1d | grep Runtime
"Runtime": "habana",
"CpuRealtimeRuntime": 0,
Inspecting the container:
$ docker inspect eb0bf40f6b1d
[
{
"Id": "eb0bf40f6b1d0392e38e9e1542650fab9df37e851d4d74388dfe2d904fcabdce",
"Created": "2025-02-11T16:06:30.690094085Z",
"Path": "sleep",
"Args": [
"infinity"
],
"State": {
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 31171,
"ExitCode": 0,11:30 AM 2/11/2025
"Error": "",
"StartedAt": "2025-02-11T16:06:33.601305388Z",
"FinishedAt": "0001-01-01T00:00:00Z"
},
"Image": "sha256:1d0c9dbfacfdffb8ed752b9f56519c3290645bb4ec81542a7256feb18f65c9a3",
"ResolvConfPath": "/var/lib/docker/containers/eb0bf40f6b1d0392e38e9e1542650fab9df37e851d4d74388dfe2d904fcabdce/resolv.conf",
"HostnamePath": "/var/lib/docker/containers/eb0bf40f6b1d0392e38e9e1542650fab9df37e851d4d74388dfe2d904fcabdce/hostname",
"HostsPath": "/var/lib/docker/containers/eb0bf40f6b1d0392e38e9e1542650fab9df37e851d4d74388dfe2d904fcabdce/hosts",
"LogPath": "/var/lib/docker/containers/eb0bf40f6b1d0392e38e9e1542650fab9df37e851d4d74388dfe2d904fcabdce/eb0bf40f6b1d0392e38e9e1542650fab9df37e851d4d74388dfe2d904fcabdce-json.log",
"Name": "/boring_lehmann",
"RestartCount": 0,
"Driver": "overlay2",
"Platform": "linux",
"MountLabel": "",
"ProcessLabel": "",
"AppArmorProfile": "docker-default",
"ExecIDs": null,
"HostConfig": {
"Binds": null,
"ContainerIDFile": "",
"LogConfig": {
"Type": "json-file",
"Config": {}
},
"NetworkMode": "host",
"PortBindings": {},
"RestartPolicy": {
"Name": "no",
"MaximumRetryCount": 0
},
"AutoRemove": false,
"VolumeDriver": "",
"VolumesFrom": null,
"ConsoleSize": [
49,
189
],
"CapAdd": [
"sys_nice"
],
"CapDrop": null,
"CgroupnsMode": "private",
"Dns": [],
"DnsOptions": [],
"DnsSearch": [],
"ExtraHosts": null,
"GroupAdd": null,
"IpcMode": "host",
"Cgroup": "",
"Links": null,
"OomScoreAdj": 0,
"PidMode": "",
"Privileged": false,
"PublishAllPorts": false,
"ReadonlyRootfs": false,
"SecurityOpt": [
"label=disable"
],
"UTSMode": "",
"UsernsMode": "",
"ShmSize": 67108864,
"Runtime": "habana",
"Isolation": "",
"CpuShares": 0,
"Memory": 0,
"NanoCpus": 0,
"CgroupParent": "",
"BlkioWeight": 0,
"BlkioWeightDevice": [],
"BlkioDeviceReadBps": [],
"BlkioDeviceWriteBps": [],
"BlkioDeviceReadIOps": [],
"BlkioDeviceWriteIOps": [],
"CpuPeriod": 0,
"CpuQuota": 0,
"CpuRealtimePeriod": 0,
"CpuRealtimeRuntime": 0,
"CpusetCpus": "",
"CpusetMems": "",
"Devices": [],
"DeviceCgroupRules": null,
"DeviceRequests": null,
"MemoryReservation": 0,
"MemorySwap": 0,
"MemorySwappiness": null,
"OomKillDisable": null,
"PidsLimit": null,
"Ulimits": [],
"CpuCount": 0,
"CpuPercent": 0,
"IOMaximumIOps": 0,
"IOMaximumBandwidth": 0,
"MaskedPaths": [
"/proc/asound",
"/proc/acpi",
"/proc/kcore",
"/proc/keys",
"/proc/latency_stats",
"/proc/timer_list",
"/proc/timer_stats",
"/proc/sched_debug",
"/proc/scsi",
"/sys/firmware",
"/sys/devices/virtual/powercap"
],
"ReadonlyPaths": [
"/proc/bus",
"/proc/fs",
"/proc/irq",
"/proc/sys",
"/proc/sysrq-trigger"
]
},
"GraphDriver": {
"Data": {
"LowerDir": "/var/lib/docker/overlay2/99f176c4457ecbc3eb35beb6ae2e7c8c98c5897c6a74ed95d4a617f239051f84-init/diff:/var/lib/docker/overlay2/84a1dd5c7734190f9d1d61036a16e11c84e61574e09e64eec206d0c32b48054a/diff:/var/lib/docker/overlay2/f927604ccd4f437148a7c90d3e4e6f42266d84bd4ef415e807e18ccaf67362d5/diff:/var/lib/docker/overlay2/4ef1a9882c291c23760de9b0a74e196e230f1f7c37f61ec38db07539f4bd7f78/diff:/var/lib/docker/overlay2/720dd7a239d971dfc9243be25f090392ab5d48825897aef448d1b3a867c397d5/diff:/var/lib/docker/overlay2/65ff9290ebd144007a656959a1bbf2aa2c147fef0bc0f2af441bcd7077c53a18/diff:/var/lib/docker/overlay2/9b6f21580d4adc9da99557c895d6f271116421007d4a7a1c23d7f238ce41dd57/diff:/var/lib/docker/overlay2/3436c6314fa980e97b2f1fad0df95cb41caf626a7f776852405964dbc76d8d6a/diff:/var/lib/docker/overlay2/e426528d3b3300f3087fb6d1a59f72a243ed57a722f576b4f94f8574ef6e9e28/diff:/var/lib/docker/overlay2/169b6c543d8a5bd1ca87e26c1c2050f8e95635eb8e26122c97a6811cefa434d5/diff:/var/lib/docker/overlay2/1333522faba816e0b61ac33b484328ad8d8422c797c48470b3adabba890660eb/diff:/var/lib/docker/overlay2/d646b22eeb7590a13ed22a6fbb909ff702580e6e728609a597e451af0d5e03cf/diff:/var/lib/docker/overlay2/1062b70b19dff8afe284d195299e81826f0ce5f20004b42406320a05e48d9ca4/diff:/var/lib/docker/overlay2/93a7eb0237d95369d64caa402907dc4b16f82ccced8423bafbc9a8aa26e5eeda/diff:/var/lib/docker/overlay2/32a07cfc772d82b895a6fcd4019d007501d51f3bbc91ede16f3a7592a207e63e/diff",
"MergedDir": "/var/lib/docker/overlay2/99f176c4457ecbc3eb35beb6ae2e7c8c98c5897c6a74ed95d4a617f239051f84/merged",
"UpperDir": "/var/lib/docker/overlay2/99f176c4457ecbc3eb35beb6ae2e7c8c98c5897c6a74ed95d4a617f239051f84/diff",
"WorkDir": "/var/lib/docker/overlay2/99f176c4457ecbc3eb35beb6ae2e7c8c98c5897c6a74ed95d4a617f239051f84/work"
},
"Name": "overlay2"
},
"Mounts": [],
"Config": {
"Hostname": "ng-nx6n7s4yyi-6f812",
"Domainname": "",
"User": "",
"AttachStdin": false,
"AttachStdout": false,
"AttachStderr": false,
"Tty": false,
"OpenStdin": false,
"StdinOnce": false,
"Env": [
"HABANA_VISIBLE_DEVICES=all",
"OMPI_MCA_btl_vader_single_copy_mechanism=none",
"PATH=/opt/habanalabs/libfabric-1.22.0/bin:/opt/amazon/openmpi/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"DEBIAN_FRONTEND=noninteractive",
"GC_KERNEL_PATH=/usr/lib/habanalabs/libtpc_kernels.so",
"HABANA_LOGS=/var/log/habana_logs/",
"OS_NUMBER=2204",
"HABANA_SCAL_BIN_PATH=/opt/habanalabs/engines_fw",
"HABANA_PLUGINS_LIB_PATH=/opt/habanalabs/habana_plugins",
"PIP_NO_CACHE_DIR=on",
"PIP_DEFAULT_TIMEOUT=1000",
"PIP_DISABLE_PIP_VERSION_CHECK=1",
"LIBFABRIC_VERSION=1.22.0",
"LIBFABRIC_ROOT=/opt/habanalabs/libfabric-1.22.0",
"MPI_ROOT=/opt/amazon/openmpi",
"LD_LIBRARY_PATH=/opt/habanalabs/libfabric-1.22.0/lib:/opt/amazon/openmpi/lib:/usr/lib/habanalabs:",
"OPAL_PREFIX=/opt/amazon/openmpi",
"MPICC=/opt/amazon/openmpi/bin/mpicc",
"RDMAV_FORK_SAFE=1",
"FI_EFA_USE_DEVICE_RDMA=1",
"RDMA_CORE_ROOT=/opt/habanalabs/rdma-core/src",
"RDMA_CORE_LIB=/opt/habanalabs/rdma-core/src/build/lib",
"PYTHONPATH=/root:/usr/lib/habanalabs/",
"LD_PRELOAD=/lib/x86_64-linux-gnu/libtcmalloc.so.4",
"TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD=7516192768"
],
"Cmd": [
"sleep",
"infinity"
],
"Image": "vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest",
"Volumes": null,
"WorkingDir": "",
"Entrypoint": null,
"OnBuild": null,
"Labels": {
"org.opencontainers.image.ref.name": "ubuntu",
"org.opencontainers.image.version": "22.04"
}
},
"NetworkSettings": {
"Bridge": "",
"SandboxID": "2663a081212073036fdafcefd1d0d214b1f4d0023adfc9c20f49e9ff3f4dbec4",
"SandboxKey": "/var/run/docker/netns/default",
"Ports": {},
"HairpinMode": false,
"LinkLocalIPv6Address": "",
"LinkLocalIPv6PrefixLen": 0,
"SecondaryIPAddresses": null,
"SecondaryIPv6Addresses": null,
"EndpointID": "",
"Gateway": "",
"GlobalIPv6Address": "",
"GlobalIPv6PrefixLen": 0,
"IPAddress": "",
"IPPrefixLen": 0,
"IPv6Gateway": "",
"MacAddress": "",
"Networks": {
"host": {
"IPAMConfig": null,
"Links": null,
"Aliases": null,
"MacAddress": "",
"DriverOpts": null,
"NetworkID": "528b711e870bfc399f943e2db9499cd9993d5441e6ac2bec38e54a4862f39af1",
"EndpointID": "ff60fa70e73a35564bb4f73bd2bc34c86c7d0466f27bd90f346aa2d969604538",
"Gateway": "",
"IPAddress": "",
"IPPrefixLen": 0,
"IPv6Gateway": "",
"GlobalIPv6Address": "",
"GlobalIPv6PrefixLen": 0,
"DNSNames": null
}
}
}
}
]
Thanks again for the support.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have been unable to find anything wrong with the docker container configuration file you sent me; it is nearly identical to a working container I have executed. The only thing I can think of to do is to install the 1.19.1 habanalabs-container runtime and see if I can reproduce your issue. In the meantime, the only possible solution I can suggest is to install the new 1.19.2 Gaudi software stack (released yesterday) and see if that resolves your issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello @James_Edwards , we've upgraded Gaudi SW stack to 1.19.2. Unfortunately the behavior is the same, we're still getting:
# hl-smi
habanalabs driver is not loaded or no AIPs available, aborting...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Okey, it'd be great if you can get to replicate the error. From our side we can upgrade to 1.19.2.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I updated a system to the 1.19.2 version of Intel Gaudi software and configured the habanalabs-container-runtime (at version1.19.2-32), as specified in the comments above. I started a docker container with the following command:
.
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.2/ubuntu24.04/habanalabs/pytorch-installer-2.5.1:latest
.
All devices were available in the container an hl-smi gave no errors. Basically, I was unable to reproduce the problem with the latest software.
.
Try executing the docker command with the --privileged flag not using the habana runtime:
.
docker run -it --privileged --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.2/ubuntu24.04/habanalabs/pytorch-installer-2.5.1:latest
.
I know this is brute force, but it will tell us if the runtime is the issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Interesting. Running:
docker run -it --privileged --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.2/ubuntu24.04/habanalabs/pytorch-installer-2.5.1:latest
I'm actually able to run hl-smi inside the container:
# hl-smi
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.19.2-fw-57.2.4.0 |
| Driver Version: 1.19.2-ff37fea |
|-------------------------------+----------------------+----------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncor-Events|
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-225 N/A | 0000:33:00.0 N/A | 0 |
| N/A 20C N/A 86W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 1 HL-225 N/A | 0000:9a:00.0 N/A | 0 |
| N/A 21C N/A 92W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 2 HL-225 N/A | 0000:9b:00.0 N/A | 0 |
| N/A 24C N/A 100W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 3 HL-225 N/A | 0000:34:00.0 N/A | 0 |
| N/A 23C N/A 75W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 4 HL-225 N/A | 0000:b3:00.0 N/A | 0 |
| N/A 23C N/A 82W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 5 HL-225 N/A | 0000:4d:00.0 N/A | 0 |
| N/A 23C N/A 93W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 6 HL-225 N/A | 0000:4e:00.0 N/A | 0 |
| N/A 22C N/A 80W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 7 HL-225 N/A | 0000:b4:00.0 N/A | 0 |
| N/A 21C N/A 62W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 N/A N/A N/A N/A |
| 1 N/A N/A N/A N/A |
| 2 N/A N/A N/A N/A |
| 3 N/A N/A N/A N/A |
| 4 N/A N/A N/A N/A |
| 5 N/A N/A N/A N/A |
| 6 N/A N/A N/A N/A |
| 7 N/A N/A N/A N/A |
+=============================================================================+
Could this be on how the SW stack was installed? We didn't use sudo for this though
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This tells me that your Intel Gaudi software drivers are installed correctly, have the correct permissions and are available to docker. It seems like your problem is with the habana container runtime and how it initializes the devices in the container's environment. It could be how the container runtime was installed and configured, but we reinstalled and checked the configuration of a container; nothing looked wrong. I will check with the ACE team for next steps.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page