- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I run a 16 node 256 core Dell cluster running on Redhat Enterprise Linux.
Our primary use is to run the engineering software LSTC LS-Dyna. With a recent change in LSTC licensing the newest versions of the software we want to now will only run using IntelMPI (previously we used PlatformMPI).
I cannot however now seem to get the PBS job submission script to work with InelMPI that used to work with PlatformMPI.
The submission script reads (with the last line being the submission line for the L-Dyna testjob.k):
#!/bin/bash
#PBS -l select=8:ncpus=16:mpiprocs=16
#PBS -j oe
cd $PBS_JOBDIR
echo "starting dyna .. "
machines=$(sort -u $PBS_NODEFILE)
ml=""
for m in $machines
do
nproc=$(grep $m $PBS_NODEFILE | wc -l)
sm=$(echo $m | cut -d'.' -f1)
if [ "$ml" == "" ]
then
ml=$sm:$nproc
else
ml=$ml:$sm:$nproc
fi
done
echo Machine line: $ml
echo PBS_O_WORKDIR=$PBS_O_WORKDIR
echo "Current directory is:"
pwd
echo "machines"
/opt/intel/impi/2018.4.274/intel64/bin/mpirun -machines $ml /usr/local/ansys/v170/ansys/bin/linx64/ls-dyna_mpp_s_R11_1_0_x64_centos65_ifort160_sse2_intelmpi-2018 i=testjob.k pr=dysmp
When i attempt to run this job via PBS job manager and I look into the standard error file I see:
[mpiexec@gpunode03.hpc.internal] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@gpunode03.hpc.internal] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:253): unable to write data to proxy
[mpiexec@gpunode03.hpc.internal] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:176): unable to send signal downstream
[mpiexec@gpunode03.hpc.internal] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@gpunode03.hpc.internal] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:520): error waiting for event
[mpiexec@gpunode03.hpc.internal] main (../../ui/mpich/mpiexec.c:1157): process manager error waiting for completion
I know I can submit a job manually (no PBS involved) and it will run on a node of the cluster ok using the IntelMPI.
So I have boiled the issue down to the section of the submission line that says -machines $ml to do with the node allocation.
For some reason IntelMPI does not accept this syntax whereas PlatormMPI did?
I am quite stumped here and any advice would be greatly appreciated.
Thanks.
Richard.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Richard,
Did you try IMPI 2019 Update 5? IMPI 2018 is not longer supported and if some fixes are required it will be applied only to IMPI 2019.
Please print $ml variable and show $PBS_NODEFILE:
echo $ml
cat $PBS_NODEFILE
Please also add -v option to mpirun:
mpirun -v -machines $ml ...
--
Best regards, Anatoliy
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Richard,
Did you try IMPI 2019 Update 5? IMPI 2018 is not longer supported and if some fixes are required it will be applied only to IMPI 2019.
Please print $ml variable and show $PBS_NODEFILE:
echo $ml
cat $PBS_NODEFILE
Please also add -v option to mpirun:
mpirun -v -machines $ml ...
--
Best regards, Anatoliy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Anatoliy, thanks so much for your reply.
I've made the small inclusions you suggested and resubmitted the job (this time I have selected the job to run on 4 nodes) with the submission script, standard error and standard output files attached.
Seems like the nodes are being allocated OK but I'm not sure?
No I haven't tried IMPI 2019 Update 5, how and where can I access this update?
Thanks again.
Richard.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for the output. In IMPI --machines option does not support syntax like host_name:num_ranks. If you want to specify num ranks for each host, please use -machinefile option. For that you will have to create host file.
You can change your script something like:
hostfile="hosts"
rm -f $hostfile
for m in $machines
do
nproc=$(grep $m $PBS_NODEFILE | wc -l)
sm=$(echo $m | cut -d'.' -f1)
echo "$sm:$nproc" >> $hostfile
done
mpirun -machinefile $hostfile ...
rm -f $hostfile
Another point is that IMPI should be able to read $PBS_NODEFILE inside. Please try to run just `mpirun -n 4 hostname`.
I guess you can download IMPI here https://software.intel.com/en-us/mpi-library/choose-download/linux (click Register & Download there).
--
Best regards, Anatoliy
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page