- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
I am a beginner in linux; i am using quantum espresso software; The cluster we have at University has three types of partition short (for jobs within an hour), long (jobs within 4 days) and superlong (jobs within 10 days). each node has 8 processors; however recently when I am running a job, only the SHORT partition works properly; this is not very useful for me as I need to run longer jobs. when i run the other two (long and superlong) I get several errors: running on more than one node say : 16 processors (2 nodes) producesan error:
"veredas60:30606: open_hca: getaddr_netdev ERROR: Connection refused. Is ib0 configured?
veredas60:30606: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?
rank 0 in job 1 veredas60_36331 caused collective abort of all ranks
exit status of rank 0: killed by signal 9"
and sometimes:
rank 0 in job 1 veredas14_39459 caused collective abort of all ranks
exit status of rank 0: return code 1
running on one node, the code stops with a similar error:
veredas60:31287: open_hca: getaddr_netdev ERROR: Connection refused. Is ib0 configured?
veredas60:31287: open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?
rank 7 in job 1 veredas60_35538 caused collective abort of all ranks
exit status of rank 7: return code 254"
We use a SLURM script..I use intel library in the following way in the script
"
if [ "${MPI}" == "INTEL" ]; then
source /opt/intel/Compiler/11.1/069/bin/iccvars.sh intel64
source /opt/intel/Compiler/11.1/069/bin/ifortvars.sh intel64
source /opt/intel/impi/4.0.0.028/intel64/bin/mpivars.sh
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
export I_MPI_FABRICS=shm:dapl
fi
The mpiexec is such that:
if [ "${MPI}" == "INTEL" ]; then
HOSTFILE=/tmp/hosts.$SLURM_JOB_ID
srun hostname -s | sort -u > ${HOSTFILE}
mpdboot -n ${SLURM_NNODES} -f ${HOSTFILE} -r ssh
mpdtrace -l
echo "start executable"
mpiexec -np ${SLURM_NPROCS} ${EXEC_DIR}/./${EXEC_BIN} <${INPUTFILE}> ${OUTPUTFILE}
mpdallexit
I am not really sure how to dal with this error..is it a problem in script? If so how do i change it ? Is it a compilation problem...Please help
Elio
UNIR
Brazil
Link Copied
0 Replies
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page