Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

MPI MPMD jobs error

syhong
Beginner
1,949 Views

Hi, 

 

I have a MPMD job :

mpiexec -f machinefile1 -n 2304 -ppn 48 ./apps1 > apps1.log 2>&1:\

                -f machinefile2 -n 900 -ppn -env OMP_NUM_THREADS=4 ./apps2 > apps2.log 2>&1

 

and module load :

intel/18.4, impi/2021.2.0

 

+ machinefile1 :

host1001:48

...

host1048:48

 

+ machinefile2 :

host1049:76

...

host1095:76

host1096:28

 

When each task(apps1 and apps2) is performed independently, it is normally performed.

However, if I do the above, the following error occurs :

[mpiexec@...] HYD_arg_set_int (../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:163): duplicate setting: ppn

[mpiexec@...] ppn_fn (../../../../src/pm/i_hydra/mpiexec/mpiexec_params.c:580): error setting ppn

...

[mpiexec@...] main (../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1749): error parsing patameters

 

So, if I do the job without -ppn option, the following error occurs :

...

Abort(821) on node 3200 (rank 3200 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 821) - process 3200

...

lsb_launch(): Failed while waiting for tasks to finish.

[mpiexec@...] wait_proxies_to_terminate (../../../../src/rm/i_hydra/mpiexec/intel/i_mpiexec.c:538): downstream from host host3386 exited with status 255

 

1. apps1(=mpi) and apps2(=hybrid) should be submitted as jobs and, cores and threads are normally placed in the compute node only when the '-f machinefile -ppn' option are included.

 Please let me know so that the MPMD operation can be performed using this.

 

2. If the intel mpi(=impi) module version is the problem, please let me know from what version of the intel mpi library supports MPMD, and if there are any versions that do not support it.

 

You help will be appreciated.

Labels (3)
0 Kudos
13 Replies
SantoshY_Intel
Moderator
1,924 Views

Hi,


Thanks for posting in the Intel forums.


Could you please provide us the following details to help us in further investigation of your issue?

  1. Operating System & its version
  2. Sample reproducer code for app1(MPI code) & app2(Hybrid code) to reproduce the issue from our end.
  3. What is the job scheduler used for launching MPI jobs on your cluster?
  4. Have you tried with both Intel MPI 2018 update 4 & Intel MPI 2021.2?
  5. What is the FI_PROVIDER you are using?


Thanks & Regards,

Santosh


0 Kudos
syhong
Beginner
1,911 Views

Hi,

 

1. CentOS Linux 8, rhel fedora

2. sample reproducer code cannot be provided due to internal security policy..

   (this HPC system has 76cores in a node.)

   apps1 is a scalability model, and there is no limit on the number of cores, but it is set to 48core*48node for memory.

   apps2 is a model that improves scalability by using threads. so, at least 2 threads must be used.

3. We are using LSF scheduler (BSUB job).

4. We tried intel mpi 2021.2 and 2019.9.304. (but, not working..)

     we have intel mpi libraries 2018.4, 2019.5, 2019.9, 2021.2, 2021.3, and 2022.1.

5. FI_PROVIDER is mlx.

 

You help will be appreciated.

Thanks,

0 Kudos
SantoshY_Intel
Moderator
1,897 Views

Hi,

 

Thanks for providing all the requested details.

 

Could you please run the below commands on your cluster and provide us with the complete debug log?

mpiicc mpi-hello.c -o mpi-hello
mpiicc hybrid.c -o hybrid

I_MPI_DEBUG=10 mpirun -f nodefile1 -n 2 -ppn 1 ./mpi-hello :\
-f nodefile2 -n 2 -ppn 1 -env OMP_NUM_THREADS=4 ./hybrid

 

Please find the attached source code(mpi-helloc & hybrid.c).

 

Thanks & Regards,

Santosh

 

0 Kudos
syhong
Beginner
1,884 Views

Hi, 

 

I have attached the error log.

And the submit shell is:

#!/bin/sh

#BSUB -J test

#BSUB -q normal

#BSUB -n 4 --ptile 1

 

export I_MPI_DEBUG=10

 

mpirun -n 2 -ppn 1 ./mpi-hello:\

              -n 2 -ppn 1 -env OMP_NUM_THREADS=4 ./hybrid


please check.

 

Thanks,

0 Kudos
syhong
Beginner
1,872 Views

Additionally, the job was submitted with some modifications to the submit shell.

It actually seemed to run well, but the node was not applied in MPMD mode.

 

submit shell is:

#!/bin/sh

#BSUB -J test

#BSUB -q normal

#BSUB -n 4 --ptile 1

#BSUB -W 12:00

 

export I_MPI_DEBUG=10

 

rm -f machinefile*
cat $LSB_DJOB_HOSTFILE | sort -u > machinefile.tmp

####################################################
for node in $(head -n 2 machinefile.tmp | tail -n 2)
do
echo ${node}:1 >> machinefile1 # ptile=1
done

for node in $(head -n 4 machinefile.tmp | tail -n 2)
do
echo ${node}:1 >> machinefile2 # ptile=1
done

####################################################

mpirun -f machinefile1 -n 2 ./mpi-hello:\
              -f machinefile2 -n 2 -env OMP_NUM_THREADS=4 ./hybrid

 

machinefile1 is :

host0001:1

host0002:1

 

machinefile2 is:

host0003:1

host0004:1

 

'top' command result in node (host0001 & host0002):

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3758274 root 20 0 276636 5608 3416 R 0.3 0.0 0:00.04 top

 

'top' command result in node (host0003 & host0004):

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
692423 syhong 20 0 2443120 149884 11596 R 100.0 0.1 62:43.33 mpi-hello
692424 syhong 20 0 2626560 151820 12012 R 100.0 0.1 62:45.40 hybrid

 

I think that the MPMD mode is not applied.., (with threads)

and the actual job is run with only the host0003&host0004 nodes in machinefile2.

 

You help will be appreciated.

Thanks,

0 Kudos
SantoshY_Intel
Moderator
1,865 Views

Hi,

 

Thanks for running the sample & sharing the outcomes with us.

 

Now, could you please try running your applications(apps1 & apps2) using the "-hosts host1,host2....." option instead of "-f nodefile" option as shown in the below command?

mpiexec -hosts host1001,...,host1048 -n 2304 -ppn 48 ./apps1 > apps1.log 2>&1:\

        -hosts host1049,....,host1096 -n 900 -ppn -env OMP_NUM_THREADS=4 ./apps2 > apps2.log 2>&1

 

Please let us know if this also causes the same behavior i.e launching only 1 job(apps1) but not a job with threads(aap2).

 

Thanks & Regards,

Santosh

 

0 Kudos
syhong
Beginner
1,842 Views

Hi,

 

To use '-hosts' option, we need to set up same nodes in BSUB as well.

We plan to use about 120 nodes over in the future, so if other jobs are using a specific node out of 120 nodes, our job can be in the 'pending' state indefinitely..

So, the above method is not efficient.

 

Please suggest another way.

 

Also, tested using the '-f machinefile' option for a mpi job (not MPMD) was working.

At this time, it was confirmed that both threads and cores were normally batched.

 

Thanks,

0 Kudos
syhong
Beginner
1,840 Views

Hi,

 

Additionally, using the -h hosts option results in the following error.

mpirun -hosts host0001 host0002 -n 2 ./mpi-hello:/

              -hosts host0003 host0004 -n 2 -env OMP_NUM_THREADS=4 ./hybrid

 

[proxy:0:0@host0003] HYD_spawn (... /src/pm/i_hydra/libhydra/spawn/intel/hydra_spawn.c:145) execvp error on file host0002 (No such file or directory)

[proxy:0:0@host0003] HYD_spawn (... /src/pm/i_hydra/libhydra/spawn/intel/hydra_spawn.c:145) execvp error on file host0004 (No such file or directory)

 

You help will be appreciated.

Thanks,

0 Kudos
SantoshY_Intel
Moderator
1,830 Views

Hi,

 

>>>"[proxy:0:0@host0003] HYD_spawn (... /src/pm/i_hydra/libhydra/spawn/intel/hydra_spawn.c:145) execvp error on file host0002 (No such file or directory)"

We need to keep a comma(,) between the hostnames as shown in the below command:

mpirun -hosts host0001,host0002 -n 2 ./mpi-hello :\
       -hosts host0003,host0004 -n 2 -env OMP_NUM_THREADS=4 ./hybrid

 

>>>"mpirun -hosts host0001 host0002 -n 2 ./mpi-hello:/"
In your previous post, I can see you are using ":/". Please make sure to keep ":\".

 

 

Thanks & Regards,

Santosh

 

0 Kudos
syhong
Beginner
1,816 Views

Hi,

 

":/" is a typing my mistake during typing. sorry..;

 

And, the error log file of the test for the proposed '-hosts' option is attached.
I think the '-hosts' option is disabled in the our LSF scheduler..

 

run script is:

mpirun -hosts host0001,host0002 -n 2 -env OMP_NUM_THREADS=1 ./mpi-hello :\
       -hosts host0003,host0004 -n 2 -env OMP_NUM_THREADS=2 ./hybrid

 

Thanks,

0 Kudos
SantoshY_Intel
Moderator
1,758 Views

Hi,


We are working on your issue and will get back to you soon.


Thanks & Regards,

Santosh


0 Kudos
SantoshY_Intel
Moderator
1,750 Views

Hi,


Thanks for reporting this issue. We were able to reproduce it and we have informed the development team about it.


Thanks & Regards,

Santosh


0 Kudos
syhong
Beginner
1,710 Views

Hi,

 

Thank you for your kind answer. we can wait!

 

Thanks & Regards,

S.

0 Kudos
Reply