Community
cancel
Showing results for 
Search instead for 
Did you mean: 
AjayBadita
Beginner
495 Views

Please explain the sequential usage of mpirun. How to get the names of involved nodes?

_________________

      mpirun -n 4 python addNumbers.py &

      sleep 10

      mpirun -n 8 python addNumbers.py

_________________

 

Without finishing the first instance of mpirun, I initialised another mpirun. I am confused weather the second mpirun makes use of new compute nodes or the old nodes?

 

How to get the names of involved nodes? (For e.g. s001-n036, s001-n035, etc.). I want the total 12 nodes (4+8) to be different.

 

Also please note that I used "&" intentionally at the end of first mpirun.

Tags (2)
0 Kudos
12 Replies
Aswathy_C_Intel
Employee
229 Views

Thanks for reaching out to us.

 

If you have 2 commands in a single job file, second one will run on the same nodes as of the first one only after its completion.

 

Please note that '-n' specified in mpirun command is the number of processes not the number of nodes. Also make sure that your code should be written in such a way to utilize mpirun. Or the code will run as many times as specified without any use.

To get the node names, you can write the below line in the job file.

cat $PBS_NODEFILE

 

Hope this helps.

Aswathy_C_Intel
Employee
229 Views

Could you please confirm whether the details provided was helpful?

 

Please be informed that the thread will get closed within 2 business days assuming that the solution provided was helpful.

 

AjayBadita
Beginner
229 Views

I'm busy doing some important thing please don't close this thread. I will inform you shortly.

Aswathy_C_Intel
Employee
229 Views

Could please tell us a time frame for checking on this?

If it takes more than a week, we suggest you open a new thread.

Please confirm.

AjayBadita
Beginner
229 Views

I need two days. Please don't close this thread.

Aswathy_C_Intel
Employee
229 Views

Sure. Will wait.

AjayBadita
Beginner
229 Views

--------------------------------------distrJob1--------------------------------------

#PBS -l nodes=2:ppn=2

cd $PBS_O_WORKDIR

mpirun -n 4 python addNumbers.py

cat $PBS_NODEFILE

--------------------------------------distrJob1--------------------------------------

 

 

--------------------------------------distrJob2--------------------------------------

#PBS -l nodes=4:ppn=2

cd $PBS_O_WORKDIR

mpirun -n 8 python addNumbers.py

cat $PBS_NODEFILE

--------------------------------------distrJob2--------------------------------------

 

--------------------------------------scriptForMPI--------------------------------------

n_0 = 4

n_1=(12-$n_0)

t_1 = 10

 

qsub distrJob1 &

sleep $t_1

qsub distrJob2

--------------------------------------scriptForMPI--------------------------------------

 

 

I started with the script named scriptForMPI and submitting two jobs distrJob1 and distrJob2. My objective is to start n_0=4 processes at once (at zeroth second) using MPI and wait for t_1=10 seconds before starting distrJob2 on another n_1=8 processes. Note that I used "&" at the end of qsub distrJob1 in scriptForMPI to make sure that I don't want to wait for first MPI instance to finish.

 

scriptForMPI created two jobs with id's distrJob1.356789, distrJob2.356790. The corresponding output and error files are as follows.

 

------------------------

distrJob1.e356789

[2] DAPL startup: RLIMIT_MEMLOCK too small

[3] DAPL startup: RLIMIT_MEMLOCK too small

------------------------

distrJob1.o356789

 ########################################################################

# Date: Sun Sep 29 04:27:08 PDT 2019

# Job ID: 356789.v-qsvr-1.aidevcloud

# User: uXXXX

# Resources: neednodes=2:ppn=2,nodes=2:ppn=2,walltime=06:00:00

########################################################################

 

s001-n093

s001-n093

s001-n097

s001-n097

 

########################################################################

# End of output for job 356789.v-qsvr-1.aidevcloud

# Date: Sun Sep 29 04:27:15 PDT 2019

########################################################################

------------------------

distrJob2.e356790

------------------------

distrJob2.o356790

########################################################################

# Date: Sun Sep 29 04:27:20 PDT 2019

# Job ID: 356790.v-qsvr-1.aidevcloud

# User: uXXXX

# Resources: neednodes=4:ppn=2,nodes=4:ppn=2,walltime=06:00:00

########################################################################

 

 

===================================================================================

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= RANK 0 PID 9074 RUNNING AT s001-n006

= KILLED BY SIGNAL: 11 (Segmentation fault)

===================================================================================

 

===================================================================================

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= RANK 1 PID 9075 RUNNING AT s001-n006

= KILLED BY SIGNAL: 9 (Killed)

===================================================================================

 

===================================================================================

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= RANK 6 PID 7539 RUNNING AT s001-n008

= KILLED BY SIGNAL: 9 (Killed)

===================================================================================

 

===================================================================================

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= RANK 7 PID 7540 RUNNING AT s001-n008

= KILLED BY SIGNAL: 9 (Killed)

===================================================================================

 

===================================================================================

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= RANK 4 PID 5617 RUNNING AT s001-n051

= KILLED BY SIGNAL: 9 (Killed)

===================================================================================

 

===================================================================================

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= RANK 5 PID 5618 RUNNING AT s001-n051

= KILLED BY SIGNAL: 9 (Killed)

===================================================================================

 

===================================================================================

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= RANK 2 PID 25166 RUNNING AT s001-n007

= KILLED BY SIGNAL: 9 (Killed)

===================================================================================

 

===================================================================================

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= RANK 3 PID 25167 RUNNING AT s001-n007

= KILLED BY SIGNAL: 9 (Killed)

===================================================================================

s001-n006

s001-n006

s001-n007

s001-n007

s001-n051

s001-n051

s001-n008

s001-n008

 

########################################################################

# End of output for job 356790.v-qsvr-1.aidevcloud

# Date: Sun Sep 29 04:27:33 PDT 2019

########################################################################

------------------------

 

Aswathy_C_Intel
Employee
229 Views

Can you share the workload so that we could try out it from our end (if it is not confidential).

 

AjayBadita
Beginner
229 Views

addNumbers.py

 

import time

start_time = time.time()

 

#from mpi4py import MPI 

#rank = MPI.COMM_WORLD.Get_rank()

#size = MPI.COMM_WORLD.Get_size()

#name = MPI.Get_processor_name()

#print ("Hello, World! " "I am process {} of {} on {}".format(rank, size, name))

 

for round in range(1,1,1):

  for i in range(1,int(1e4),1):

    num1 = 15

    num2 = 12

    sum = num1 + num2

 

end_time = time.time()

 

import sys

sys.stdout = open('TimingsOfMPI.txt','a')

print("%s" % (end_time - start_time))

sys.stdout.close()

-----------------------------------------------------

 

I am using this simple code for My research. I don't have any real workload.

Aswathy_C_Intel
Employee
229 Views

Hi,

We had tried the code from our end and it is working fine (No bad termination error).

Could you please try it once again. How did you submit your script file?

I submitted the code as a script file with the below content :

t_1 = 10

qsub distrJob1 &

sleep $t_1

qsub distrJob2

 

and submitted the script as 'sh scriptForMPI.sh'.

 

Please note that MPIRUN will not have significance if your code is not written in manner to utilize its capability.

AjayBadita
Beginner
229 Views

I have submitted the code as 'sh scriptForMPI.sh' but still getting the same error.

Aswathy_C_Intel
Employee
229 Views

We will continue this discussion over email