Intel® Gaudi® AI Accelerator
Support for the Intel® Gaudi® AI Accelerator
11 Discussions

Insufficient Slots for MPIJob with 2 Worker Pods and 2 Gaudi Cards Each

Gera_Dmz
Employee
995 Views

Based on Multi-Gaudi Workloads Example, I am trying to run an MPIJob with the following configuration:

 

 

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: mpijob
spec:
  slotsPerWorker: 2
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          hostIPC: true
          containers:
            - name: mpijob-container
              image: "vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest"
              imagePullPolicy: Always
              command: ["/bin/bash", "-c"]
              args:
                - >-
                  /usr/bin/ssh-keygen -A;
                  /usr/sbin/sshd;

                  HOSTSFILE=$OMPI_MCA_orte_default_hostfile;
                  echo "HOSTSFILE=${HOSTSFILE}";
                  MASTER_ADDR="$(head -n 1 $HOSTSFILE | sed -n s/[[:space:]]slots.*//p)";
                  echo "MASTER_ADDR=${MASTER_ADDR}";
                  NUM_NODES=$(wc -l < $HOSTSFILE);
                  echo "NUM_NODES=${NUM_NODES}";
                  CARDS_PER_NODE=2;
                  N_CARDS=$((NUM_NODES*CARDS_PER_NODE));
                  echo "N_CARDS=${N_CARDS}";

                  SETUP_CMD="git clone --single-branch --branch v1.15.0 https://github.com/huggingface/optimum-habana.git; \
                             pip install -r optimum-habana/examples/language-modeling/requirements.txt; \
                             pip install --no-cache-dir optimum-habana==1.15.0; \
                             pip install --no-cache-dir git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0";

                  eval $SETUP_CMD;

                  mpirun --npernode 1 \
                     --tag-output \
                     --allow-run-as-root \
                     --prefix $MPI_ROOT \
                     -mca routed direct \
                     git clone --single-branch --branch v1.15.0 https://github.com/huggingface/optimum-habana.git;

                  mpirun --npernode 1 \
                     --tag-output \
                     --allow-run-as-root \
                     --prefix $MPI_ROOT \
                     -mca routed direct \
                     pip install -r optimum-habana/examples/language-modeling/requirements.txt;

                  mpirun --npernode 1 \
                     --tag-output \
                     --allow-run-as-root \
                     --prefix $MPI_ROOT \
                     -mca routed direct \
                     pip install --no-cache-dir optimum-habana==1.15.0;

                  mpirun --npernode 1 \
                     --tag-output \
                     --allow-run-as-root \
                     --prefix $MPI_ROOT \
                     -mca routed direct \
                     pip install --no-cache-dir git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0;

                  MODEL_PATH=/optimum-habana/examples/language-modeling;
                  cd $MODEL_PATH;
                  mpirun -np ${N_CARDS} \
                    --allow-run-as-root \
                    --bind-to core \
                    --map-by ppr:4:socket:PE=6 \
                    -rank-by core --report-bindings \
                    --tag-output \
                    --merge-stderr-to-stdout --prefix $MPI_ROOT \
                    -x MASTER_ADDR=$MASTER_ADDR \
                    -mca btl_tcp_if_include eth0 \
                    -mca oob_tcp_if_include eth0 \
                    -mca plm_rsh_no_tree_spawn 1 \
                    python $MODEL_PATH/run_lora_clm.py \
                    --model_name_or_path huggyllama/llama-7b \
                    --dataset_name tatsu-lab/alpaca \
                    --bf16 \
                    --output_dir /tmp/pvc-mount \
                    --num_train_epochs 1 \
                    --per_device_train_batch_size 12 \
                    --evaluation_strategy no \
                    --save_strategy no \
                    --learning_rate 1e-4 \
                    --warmup_ratio 0.03 \
                    --lr_scheduler_type constant \
                    --max_grad_norm 0.3 \
                    --logging_steps 1 \
                    --do_train \
                    --do_eval \
                    --use_habana \
                    --use_lazy_mode \
                    --throughput_warmup_steps 3 \
                    --lora_rank 8 \
                    --lora_alpha 16 \
                    --lora_dropout 0.05 \
                    --lora_target_modules q_proj v_proj \
                    --dataset_concatenation \
                    --max_seq_length 512 \
                    --low_cpu_mem_usage True \
                    --validation_split_percentage 4 \
                    --adam_epsilon 1e-08;
              resources:
                limits:
                  cpu: 16
                  memory: 64Gi
                  hugepages-2Mi: 4400Mi
                requests:
                  cpu: 16
                  memory: 64Gi
                  hugepages-2Mi: 4400Mi
              volumeMounts:
                - name: hf-token
                  mountPath: /tmp/hf_token
                - name: pvc-storage
                  mountPath: /tmp/pvc-mount
          volumes:
            - name: hf-token
              secret:
                secretName: hf-token
            - name: pvc-storage
              persistentVolumeClaim:
                claimName: pvc-storage
    Worker:
      replicas: 2
      template:
        spec:
          hostIPC: true
          containers:
            - name: mpijob-container
              image: "vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest"
              imagePullPolicy: Always
              command: ["/bin/bash", "-c"]
              args:
                - >-
                  /usr/bin/ssh-keygen -A;
                  /usr/sbin/sshd;
                  sleep 365d;
              resources:
                limits:
                  habana.ai/gaudi: 2
                  cpu: 16
                  memory: 64Gi
                  hugepages-2Mi: 4400Mi
                requests:
                  habana.ai/gaudi: 2
                  cpu: 16
                  memory: 64Gi
                  hugepages-2Mi: 4400Mi
              volumeMounts:
                - name: hf-token
                  mountPath: /tmp/hf_token
                - name: pvc-storage
                  mountPath: /tmp/pvc-mount
          volumes:
            - name: hf-token
              secret:
                secretName: hf-token
            - name: pvc-storage
              persistentVolumeClaim:
                claimName: pvc-storage

 

When I run this configuration, I encounter the following error:

There are not enough slots available in the system to satisfy the 4 slots that were requested by the application: python Either request fewer slots for your application, or make more slots available for use.
 

Observations:

  1. The example works fine when using either 1 worker pod with 2 Gaudi cards, or 2 worker pods with 1 Gaudi card each.
  2. Using the --oversubscribe flag results in the following error:
RuntimeError: synStatus=8 [Device not found] Device acquire failed.
0 Kudos
11 Replies
James_Edwards
Employee
921 Views

Gaudi's devices can only be acquired by one process at a time and therefore cannot be oversubscribed. The Device acquire failure indicates that two processes are trying to acquire the same device causing one of them to fail. I have some questions:

1) Do your Gaudi systems really only have 2 devices per node, or are you just specifying 2 in the yaml. Usually the slotPerWorker is set to 8, indicating 8 available Gaudi devices are available. Running the 'hl-smi' will indicate how many devices are available.

2) Can you add the '--display-allocation' option to command and rerun? This will indicate how the mpi processes are being distributed and the slots that are being consumed on each node.

 

0 Kudos
Gera_Dmz
Employee
916 Views

Hello @James_Edwards,

1) The systems have 8 devices, we're specifying the configuration to be two nodes with two cards each for flexibility while developing.

2) Adding --display-allocation we get:

======================   ALLOCATED NODES   ======================
        mpijob-worker-0.mpijob.<namespace>.svc: flags=0x13 slots=2 max_slots=0 slots_inuse=0 state=UNKNOWN
        mpijob-worker-1.mpijob.<namespace>.svc: flags=0x13 slots=2 max_slots=0 slots_inuse=0 state=UNKNOWN
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4
slots that were requested by the application:
=================================================================

  python

Either request fewer slots for your application, or make more slots
available for use.

And executing into a worker pod I can see two cards available:

root@mpijob-worker-1:/# hl-smi
+-----------------------------------------------------------------------------+
| HL-SMI Version:                              hl-1.19.1-fw-57.2.2.0          |
| Driver Version:                                     1.19.1-6f47ddd          |
|-------------------------------+----------------------+----------------------+
| AIP  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncor-Events|
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | AIP-Util  Compute M. |
|===============================+======================+======================|
|   0  HL-225              N/A  | 0000:34:00.0     N/A |                   0  |
| N/A   26C   N/A  78W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   1  HL-225              N/A  | 0000:b3:00.0     N/A |                   0  |
| N/A   26C   N/A  85W /  600W  |   768MiB /  98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes:                                               AIP Memory |
|  AIP       PID   Type   Process name                             Usage      |
|=============================================================================|
|   0        N/A   N/A    N/A                                      N/A        |
|   1        N/A   N/A    N/A                                      N/A        |
+=============================================================================+
0 Kudos
James_Edwards
Employee
891 Views

This doesn't seem like a Gaudi platform problem but a problem with the MPIOperator or MPI command. Can you try a run without the -np option? If you do not specify the -np option in the mpirun command, the MPI implementation will typically use all available slots across the worker nodes. The number of slots used is determined by the slotsPerWorker value and the number of worker nodes.

0 Kudos
Gera_Dmz
Employee
861 Views

Tested without the -np option and got:

--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8
slots that were requested by the application:

  python

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the Open MPI term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which Open MPI processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided29
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, Open MPI defaults to the number of processor cores

In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------

======================   ALLOCATED NODES   ======================
        mpijob-worker-0.mpijob.<namespace>.svc: flags=0x13 slots=2 max_slots=0 slots_inuse=0 state=UNKNOWN
        mpijob-worker-1.mpijob.<namespace>.svc: flags=0x13 slots=2 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================

For some reason it's looking for 8 slots. Requesting slotsPerWorker: 8 wouldn't help right? As we're trying to use 2 gaudi cards per node.

0 Kudos
James_Edwards
Employee
851 Views

When you run using the np option, the number of slots requested is "N_CARDS=${N_CARDS}" = 4 given the values provided in the yaml. If there are two running pods there should be enough slots available in the two systems to execute 4 process, but there is a failure indicating that not enough slots are available, even though there are clearly 4 slots available. Removing the -np option should have made the system calculate the maximum number of slots based on the slotsPerWorker parameter (2) and the number of worker nodes (2), however it is overestimating by a factor of 2 (only 4 are available but 8 is calculated). Obviously something is wrong in the way MPI is calculating the total number of slots required for an execution from the value of these parameters.

.

It would be interesting to change "slotsPerWorker: 4" and then "8" and the limits and request stanzas "habana.ai/gaudi: 4" and then "8" and rerun the original yaml file with the -np option and the  --display-allocation option for both settings. This would mean the 2 nodes would have 8 available slots (and then 16), but only 4 would be requested, and the allocation of the slots for a successful run with more than 1 device per node could be determined. All the samples for Gaudi specify a "slotsPerWorker" of 8; this should work but it could be only possible value for that parameter.

0 Kudos
Gera_Dmz
Employee
796 Views

Trying with "slotsPerWorker: 4" & "habana.ai/gaudi: 4" I'm getting:

======================   ALLOCATED NODES   ======================
        mpijob-worker-0.mpijob.<namespace>.svc: flags=0x13 slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN
        mpijob-worker-1.mpijob.<namespace>.svc: flags=0x13 slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8
slots that were requested by the application:

  python

Either request fewer slots for your application, or make more slots
available for use.

even though there are 8 cards available within the two nodes. Setting 8 cards seems to have a different effect:

[1,0]<stderr>:[INFO|modeling_utils.py:1622] 2025-02-12 17:46:30,321 >> Instantiating GaudiLlamaForCausalLM model under default dtype torch.bfloat16.
[1,0]<stderr>:[INFO|configuration_utils.py:1099] 2025-02-12 17:46:30,322 >> Generate config GaudiGenerationConfig {
[1,0]<stderr>:  "bos_token_id": 1,
[1,0]<stderr>:  "eos_token_id": 2,
[1,0]<stderr>:  "pad_token_id": 0
[1,0]<stderr>:}
[1,0]<stderr>:
Downloading shards: 100%|██████████| 2/2 [01:30<00:00, 45.40s/it]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[9729,0],0] on node mpijob-launcher
  Remote daemon: [[9729,0],1] on node mpijob-worker-0.mpijob.<namespace>.svc

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

It gives a daemon related error.

0 Kudos
James_Edwards
Employee
789 Views

I will open an issue on for you in habana support. Please dump the following information:

.

habana plugins that are running

kubectl get pods -n habana-system

 .

Version of MPI

mpirun --version

 .

Also provide the version of the MPI Operator you deployed.

 

0 Kudos
Gera_Dmz
Employee
780 Views

Sounds good.

 

habana plugin running:

$ kubectl get pods -n habana-system
NAME                                       READY   STATUS             RESTARTS            AGE
habanalabs-device-plugin-daemonset-txsd4   1/1     Running            0                   104d
habanalabs-device-plugin-daemonset-x2jfj   1/1     Running            14 (96d ago)        119d

MPI version:

mpirun (Open MPI) 4.1.6

MPI Operator version: 0.5.0

 

Let me know if there's any update or more info is needed. Thanks.

0 Kudos
James_Edwards
Employee
632 Views
0 Kudos
Gera_Dmz
Employee
574 Views

Thanks! Since I don't have access to Habana's Jira, please let me know here if there's any update on the issue.

James_Edwards
Employee
71 Views

I have checked the ticket and no response was given. I have updated the ticket and increased priority.

Reply