Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2228 Discussions

Can't get IntelMPI to run with Altair PBS job manager

smith__richard
Beginner
2,580 Views

I run a 16 node 256 core Dell cluster running on Redhat Enterprise Linux.

Our primary use is to run the engineering software LSTC LS-Dyna. With a recent change in LSTC licensing the newest versions of the software we want to now will only run using IntelMPI (previously we used PlatformMPI).

I cannot however now seem to get the PBS job submission script to work with InelMPI that used to work with PlatformMPI.

The submission script reads (with the last line being the submission line for the L-Dyna testjob.k):

#!/bin/bash
#PBS -l select=8:ncpus=16:mpiprocs=16
#PBS -j oe
cd $PBS_JOBDIR
echo "starting dyna .. "
machines=$(sort -u $PBS_NODEFILE)
ml=""
for m in $machines
do
   nproc=$(grep $m $PBS_NODEFILE | wc -l)
   sm=$(echo $m | cut -d'.' -f1)
   if [ "$ml" == "" ]
   then
      ml=$sm:$nproc
   else
      ml=$ml:$sm:$nproc
   fi
done
echo Machine line: $ml
echo PBS_O_WORKDIR=$PBS_O_WORKDIR
echo "Current directory is:"
pwd
echo "machines"
/opt/intel/impi/2018.4.274/intel64/bin/mpirun -machines $ml /usr/local/ansys/v170/ansys/bin/linx64/ls-dyna_mpp_s_R11_1_0_x64_centos65_ifort160_sse2_intelmpi-2018 i=testjob.k pr=dysmp

When i attempt to run this job via PBS job manager and I look into the standard error file I see:

[mpiexec@gpunode03.hpc.internal] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@gpunode03.hpc.internal] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:253): unable to write data to proxy
[mpiexec@gpunode03.hpc.internal] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:176): unable to send signal downstream
[mpiexec@gpunode03.hpc.internal] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@gpunode03.hpc.internal] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:520): error waiting for event
[mpiexec@gpunode03.hpc.internal] main (../../ui/mpich/mpiexec.c:1157): process manager error waiting for completion

I know I can submit a job manually (no PBS involved) and it will run on a node of the cluster ok using the IntelMPI.

So I have boiled the issue down to the section of the submission line that says -machines $ml to do with the node allocation.

For some reason IntelMPI does not accept this syntax whereas PlatormMPI did?

I am quite stumped here and any advice would be greatly appreciated.

Thanks.

Richard.

 

 

0 Kudos
1 Solution
Anatoliy_R_Intel
Employee
2,580 Views

Hello Richard,

 

Did you try IMPI 2019 Update 5? IMPI 2018 is not longer supported and if some fixes are required it will be applied only to IMPI 2019.

 

Please print $ml variable and show $PBS_NODEFILE:

echo $ml

cat $PBS_NODEFILE

 

Please also add -v option to mpirun:

mpirun -v -machines $ml ...

 

--

Best regards, Anatoliy

View solution in original post

0 Kudos
3 Replies
Anatoliy_R_Intel
Employee
2,581 Views

Hello Richard,

 

Did you try IMPI 2019 Update 5? IMPI 2018 is not longer supported and if some fixes are required it will be applied only to IMPI 2019.

 

Please print $ml variable and show $PBS_NODEFILE:

echo $ml

cat $PBS_NODEFILE

 

Please also add -v option to mpirun:

mpirun -v -machines $ml ...

 

--

Best regards, Anatoliy

0 Kudos
smith__richard
Beginner
2,580 Views

Hi Anatoliy, thanks so much for your reply.

I've made the small inclusions you suggested and resubmitted the job (this time I have selected the job to run on 4 nodes) with the submission script, standard error and standard output files attached. 

Seems like the nodes are being allocated OK but I'm not sure?

No I haven't tried IMPI 2019 Update 5, how and where can I access this update?

Thanks again.

Richard.

 

0 Kudos
Anatoliy_R_Intel
Employee
2,580 Views

Hi,

 

Thank you for the output. In IMPI --machines option does not support syntax like host_name:num_ranks. If you want to specify num ranks for each host, please use -machinefile option. For that you will have to create host file. 

 

You can change your script something like:

hostfile="hosts"

rm -f $hostfile

for m in $machines
do
   nproc=$(grep $m $PBS_NODEFILE | wc -l)
   sm=$(echo $m | cut -d'.' -f1)

   echo "$sm:$nproc" >> $hostfile
done

mpirun -machinefile $hostfile ...

rm -f $hostfile

 

Another point is that IMPI should be able to read $PBS_NODEFILE inside. Please try to run just `mpirun -n 4 hostname`.

 

I guess you can download IMPI here https://software.intel.com/en-us/mpi-library/choose-download/linux (click Register & Download there). 

--

Best regards, Anatoliy

0 Kudos
Reply