problem to use 4 nodes

yyxt11a · ‎07-15-2010

I am trying to calculate a function's gradient vector of size 12 on a PC cluster. It seems to be running ok if I request one node with 4 cores by using:
#PBS -l select=1:ncpus=4:mem=1200mb
This way, each processor will calculate 3 elements of the gradient.

Now, I want to use 12 processors to calculate the 12 elements gradient vector in one go, so I tried to request 3 nodes by using:
#PBS -l select=3:ncpus=4:mem=1200mb
It complained:
>>> rank 7 in job 1 cx1-5-3-2.cx1.hpc.ic.ac.uk_49216 caused collective abort of all ranks exit status of rank 7: return code 29
...... etc.

I am new to MPI. Is there anything I should be aware of when requesting multinodes? Many thanks for reading my thread.

Gergana_S_Intel · ‎07-15-2010

Hi,

Welcome to the Intel HPC forums!

It seems like your PBS script is fine but the "caused collective abort of all ranks" error is fairly generic. It mostly means your application failed. It would be great if you could provide your full PBS script, with your mpirun/mpiexec command line, etc. Also, any info on your cluster (e.g. OS version, using InfiniBand or Ethernet, MPI library version, math library version - MKL or something else, etc) would be helpful.

Looking forward to hearing back.

Regards,
~Gergana

yyxt11a · ‎07-16-2010

I have asked the HPC administrator. He said each node has its own harddrive. Because there are input data files that are required by all MPI ranks, I have to copy the files to all the local harddrives on the nodes of the job by using "pbsdsh" command in the job script.

This seems to have solved my problem.