- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Based on Multi-Gaudi Workloads Example, I am trying to run an MPIJob with the following configuration:
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: mpijob
spec:
slotsPerWorker: 2
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
hostIPC: true
containers:
- name: mpijob-container
image: "vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest"
imagePullPolicy: Always
command: ["/bin/bash", "-c"]
args:
- >-
/usr/bin/ssh-keygen -A;
/usr/sbin/sshd;
HOSTSFILE=$OMPI_MCA_orte_default_hostfile;
echo "HOSTSFILE=${HOSTSFILE}";
MASTER_ADDR="$(head -n 1 $HOSTSFILE | sed -n s/[[:space:]]slots.*//p)";
echo "MASTER_ADDR=${MASTER_ADDR}";
NUM_NODES=$(wc -l < $HOSTSFILE);
echo "NUM_NODES=${NUM_NODES}";
CARDS_PER_NODE=2;
N_CARDS=$((NUM_NODES*CARDS_PER_NODE));
echo "N_CARDS=${N_CARDS}";
SETUP_CMD="git clone --single-branch --branch v1.15.0 https://github.com/huggingface/optimum-habana.git; \
pip install -r optimum-habana/examples/language-modeling/requirements.txt; \
pip install --no-cache-dir optimum-habana==1.15.0; \
pip install --no-cache-dir git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0";
eval $SETUP_CMD;
mpirun --npernode 1 \
--tag-output \
--allow-run-as-root \
--prefix $MPI_ROOT \
-mca routed direct \
git clone --single-branch --branch v1.15.0 https://github.com/huggingface/optimum-habana.git;
mpirun --npernode 1 \
--tag-output \
--allow-run-as-root \
--prefix $MPI_ROOT \
-mca routed direct \
pip install -r optimum-habana/examples/language-modeling/requirements.txt;
mpirun --npernode 1 \
--tag-output \
--allow-run-as-root \
--prefix $MPI_ROOT \
-mca routed direct \
pip install --no-cache-dir optimum-habana==1.15.0;
mpirun --npernode 1 \
--tag-output \
--allow-run-as-root \
--prefix $MPI_ROOT \
-mca routed direct \
pip install --no-cache-dir git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0;
MODEL_PATH=/optimum-habana/examples/language-modeling;
cd $MODEL_PATH;
mpirun -np ${N_CARDS} \
--allow-run-as-root \
--bind-to core \
--map-by ppr:4:socket:PE=6 \
-rank-by core --report-bindings \
--tag-output \
--merge-stderr-to-stdout --prefix $MPI_ROOT \
-x MASTER_ADDR=$MASTER_ADDR \
-mca btl_tcp_if_include eth0 \
-mca oob_tcp_if_include eth0 \
-mca plm_rsh_no_tree_spawn 1 \
python $MODEL_PATH/run_lora_clm.py \
--model_name_or_path huggyllama/llama-7b \
--dataset_name tatsu-lab/alpaca \
--bf16 \
--output_dir /tmp/pvc-mount \
--num_train_epochs 1 \
--per_device_train_batch_size 12 \
--evaluation_strategy no \
--save_strategy no \
--learning_rate 1e-4 \
--warmup_ratio 0.03 \
--lr_scheduler_type constant \
--max_grad_norm 0.3 \
--logging_steps 1 \
--do_train \
--do_eval \
--use_habana \
--use_lazy_mode \
--throughput_warmup_steps 3 \
--lora_rank 8 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--lora_target_modules q_proj v_proj \
--dataset_concatenation \
--max_seq_length 512 \
--low_cpu_mem_usage True \
--validation_split_percentage 4 \
--adam_epsilon 1e-08;
resources:
limits:
cpu: 16
memory: 64Gi
hugepages-2Mi: 4400Mi
requests:
cpu: 16
memory: 64Gi
hugepages-2Mi: 4400Mi
volumeMounts:
- name: hf-token
mountPath: /tmp/hf_token
- name: pvc-storage
mountPath: /tmp/pvc-mount
volumes:
- name: hf-token
secret:
secretName: hf-token
- name: pvc-storage
persistentVolumeClaim:
claimName: pvc-storage
Worker:
replicas: 2
template:
spec:
hostIPC: true
containers:
- name: mpijob-container
image: "vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest"
imagePullPolicy: Always
command: ["/bin/bash", "-c"]
args:
- >-
/usr/bin/ssh-keygen -A;
/usr/sbin/sshd;
sleep 365d;
resources:
limits:
habana.ai/gaudi: 2
cpu: 16
memory: 64Gi
hugepages-2Mi: 4400Mi
requests:
habana.ai/gaudi: 2
cpu: 16
memory: 64Gi
hugepages-2Mi: 4400Mi
volumeMounts:
- name: hf-token
mountPath: /tmp/hf_token
- name: pvc-storage
mountPath: /tmp/pvc-mount
volumes:
- name: hf-token
secret:
secretName: hf-token
- name: pvc-storage
persistentVolumeClaim:
claimName: pvc-storage
When I run this configuration, I encounter the following error:
There are not enough slots available in the system to satisfy the 4 slots that were requested by the application: python Either request fewer slots for your application, or make more slots available for use.
Observations:
- The example works fine when using either 1 worker pod with 2 Gaudi cards, or 2 worker pods with 1 Gaudi card each.
- Using the --oversubscribe flag results in the following error:
RuntimeError: synStatus=8 [Device not found] Device acquire failed.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Gaudi's devices can only be acquired by one process at a time and therefore cannot be oversubscribed. The Device acquire failure indicates that two processes are trying to acquire the same device causing one of them to fail. I have some questions:
1) Do your Gaudi systems really only have 2 devices per node, or are you just specifying 2 in the yaml. Usually the slotPerWorker is set to 8, indicating 8 available Gaudi devices are available. Running the 'hl-smi' will indicate how many devices are available.
2) Can you add the '--display-allocation' option to command and rerun? This will indicate how the mpi processes are being distributed and the slots that are being consumed on each node.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello @James_Edwards,
1) The systems have 8 devices, we're specifying the configuration to be two nodes with two cards each for flexibility while developing.
2) Adding --display-allocation we get:
====================== ALLOCATED NODES ======================
mpijob-worker-0.mpijob.<namespace>.svc: flags=0x13 slots=2 max_slots=0 slots_inuse=0 state=UNKNOWN
mpijob-worker-1.mpijob.<namespace>.svc: flags=0x13 slots=2 max_slots=0 slots_inuse=0 state=UNKNOWN
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4
slots that were requested by the application:
=================================================================
python
Either request fewer slots for your application, or make more slots
available for use.
And executing into a worker pod I can see two cards available:
root@mpijob-worker-1:/# hl-smi
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.19.1-fw-57.2.2.0 |
| Driver Version: 1.19.1-6f47ddd |
|-------------------------------+----------------------+----------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncor-Events|
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-225 N/A | 0000:34:00.0 N/A | 0 |
| N/A 26C N/A 78W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 1 HL-225 N/A | 0000:b3:00.0 N/A | 0 |
| N/A 26C N/A 85W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 N/A N/A N/A N/A |
| 1 N/A N/A N/A N/A |
+=============================================================================+
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This doesn't seem like a Gaudi platform problem but a problem with the MPIOperator or MPI command. Can you try a run without the -np option? If you do not specify the -np option in the mpirun command, the MPI implementation will typically use all available slots across the worker nodes. The number of slots used is determined by the slotsPerWorker value and the number of worker nodes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tested without the -np option and got:
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8
slots that were requested by the application:
python
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided29
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
====================== ALLOCATED NODES ======================
mpijob-worker-0.mpijob.<namespace>.svc: flags=0x13 slots=2 max_slots=0 slots_inuse=0 state=UNKNOWN
mpijob-worker-1.mpijob.<namespace>.svc: flags=0x13 slots=2 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
For some reason it's looking for 8 slots. Requesting slotsPerWorker: 8 wouldn't help right? As we're trying to use 2 gaudi cards per node.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When you run using the np option, the number of slots requested is "N_CARDS=${N_CARDS}" = 4 given the values provided in the yaml. If there are two running pods there should be enough slots available in the two systems to execute 4 process, but there is a failure indicating that not enough slots are available, even though there are clearly 4 slots available. Removing the -np option should have made the system calculate the maximum number of slots based on the slotsPerWorker parameter (2) and the number of worker nodes (2), however it is overestimating by a factor of 2 (only 4 are available but 8 is calculated). Obviously something is wrong in the way MPI is calculating the total number of slots required for an execution from the value of these parameters.
.
It would be interesting to change "slotsPerWorker: 4" and then "8" and the limits and request stanzas "habana.ai/gaudi: 4" and then "8" and rerun the original yaml file with the -np option and the --display-allocation option for both settings. This would mean the 2 nodes would have 8 available slots (and then 16), but only 4 would be requested, and the allocation of the slots for a successful run with more than 1 device per node could be determined. All the samples for Gaudi specify a "slotsPerWorker" of 8; this should work but it could be only possible value for that parameter.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Trying with "slotsPerWorker: 4" & "habana.ai/gaudi: 4" I'm getting:
====================== ALLOCATED NODES ======================
mpijob-worker-0.mpijob.<namespace>.svc: flags=0x13 slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN
mpijob-worker-1.mpijob.<namespace>.svc: flags=0x13 slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8
slots that were requested by the application:
python
Either request fewer slots for your application, or make more slots
available for use.
even though there are 8 cards available within the two nodes. Setting 8 cards seems to have a different effect:
[1,0]<stderr>:[INFO|modeling_utils.py:1622] 2025-02-12 17:46:30,321 >> Instantiating GaudiLlamaForCausalLM model under default dtype torch.bfloat16.
[1,0]<stderr>:[INFO|configuration_utils.py:1099] 2025-02-12 17:46:30,322 >> Generate config GaudiGenerationConfig {
[1,0]<stderr>: "bos_token_id": 1,
[1,0]<stderr>: "eos_token_id": 2,
[1,0]<stderr>: "pad_token_id": 0
[1,0]<stderr>:}
[1,0]<stderr>:
Downloading shards: 100%|██████████| 2/2 [01:30<00:00, 45.40s/it]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[9729,0],0] on node mpijob-launcher
Remote daemon: [[9729,0],1] on node mpijob-worker-0.mpijob.<namespace>.svc
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
It gives a daemon related error.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I will open an issue on for you in habana support. Please dump the following information:
.
habana plugins that are running
kubectl get pods -n habana-system
.
Version of MPI
mpirun --version
.
Also provide the version of the MPI Operator you deployed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sounds good.
habana plugin running:
$ kubectl get pods -n habana-system
NAME READY STATUS RESTARTS AGE
habanalabs-device-plugin-daemonset-txsd4 1/1 Running 0 104d
habanalabs-device-plugin-daemonset-x2jfj 1/1 Running 14 (96d ago) 119d
MPI version:
mpirun (Open MPI) 4.1.6
MPI Operator version: 0.5.0
Let me know if there's any update or more info is needed. Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Issue opened with the Habana Service Desk: https://habana.atlassian.net/jira/servicedesk/projects/HS/issues/HS-5134
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks! Since I don't have access to Habana's Jira, please let me know here if there's any update on the issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have checked the ticket and no response was given. I have updated the ticket and increased priority.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page