Community
cancel
Showing results for 
Search instead for 
Did you mean: 
ELIO_M_
Beginner
62 Views

open_hca: getaddr_netdev ERROR: Connection refused. Is ib0 configured ERROR

Dear all,

I am a beginner in linux; i am using quantum espresso software; The cluster we have at University has three types of partition short (for jobs within an hour), long (jobs within 4 days) and superlong (jobs within 10 days). each node has 8 processors; however recently when I am running a job, only the SHORT partition works properly; this is not very useful for me as I need to run longer jobs. when i run the other two (long and superlong) I get several errors: running on more than one node say : 16 processors (2 nodes) producesan error:

"veredas60:30606:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib0 configured?

veredas60:30606:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?

rank 0 in job 1  veredas60_36331   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9"
 
and sometimes: 
rank 0 in job 1  veredas14_39459   caused collective abort of all ranks
  exit status of rank 0: return code 1
 
running on one node, the code stops with a similar error:
 
veredas60:31287:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib0 configured?
veredas60:31287:  open_hca: getaddr_netdev ERROR: Connection refused. Is ib1 configured?
rank 7 in job 1  veredas60_35538   caused collective abort of all ranks
  exit status of rank 7: return code 254"
 
We use a SLURM script..I use intel library in the following way in the script
"
if [ "${MPI}" == "INTEL" ]; then
    source /opt/intel/Compiler/11.1/069/bin/iccvars.sh intel64
    source /opt/intel/Compiler/11.1/069/bin/ifortvars.sh intel64
    source /opt/intel/impi/4.0.0.028/intel64/bin/mpivars.sh
    export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
    export I_MPI_FABRICS=shm:dapl
    fi
The mpiexec is such that:
 
if [ "${MPI}" == "INTEL" ]; then
    HOSTFILE=/tmp/hosts.$SLURM_JOB_ID
    srun hostname -s | sort -u > ${HOSTFILE}
    mpdboot -n ${SLURM_NNODES} -f ${HOSTFILE} -r ssh
    mpdtrace -l
    echo "start executable"
    mpiexec -np ${SLURM_NPROCS} ${EXEC_DIR}/./${EXEC_BIN} <${INPUTFILE}> ${OUTPUTFILE}
    mpdallexit
 
I am not really sure how to dal with this error..is it a problem in script? If so how do i change it ? Is it a compilation problem...Please help
 
Elio
UNIR
Brazil
0 Kudos
0 Replies
Reply