Problem with mpirun.in binding processes in a cluster URGENT

Sayan_B_ · ‎08-21-2016

Hi,

I am having a huge trouble in submitting job remotely with the help of a PBS script in a HPC cluster (lscpu output in the login node is as bellow).

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Stepping: 7
CPU MHz: 2593.778
BogoMIPS: 5186.81
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15

In our local server (2*8 core with hyper-threading 2*16 thread) when I am submitting the job (mpirun -np 6 -map-by node relion_refine_mpi ... -j 20) it's running fine with 20 thread and 6 processes. In our local server we use openmpi.

But when I am submitting the job to the above said cluster, with a pbs script (attached) the program "relion_refine_mpi" is not running. I have also attached the output files.

For your information, the program I am trying to run is a scientific program (Relion) which needs to compiled with openmpi. But in the cluster after compiling the software module, I can't use openmpi to run the job. I have to run the job through intel mpirun.

From the forrum of the developing group of the program, It's said that the cluster is not assigning any processes to the requested nodes. the mpirun command is not following the -np flag. I tried to give the value of the mpirun -np manually and from the $PBS_NODEFILE. None of them worked.

Please help me properly writing the PBS script as soon as you can. It's really important for my research work.

Sayan_B_ · ‎08-21-2016

just found out something. the number of process I am specifying is not being allocated to the program I am running. Can you tell me why? How I am supposed to get rid of this problem? please help.

James_T_Intel · ‎09-29-2016

The first problem is that the Intel® MPI Library is not binary compatible with OpenMPI*. Any programs compiled with one will not be able to run under the other.

I'm not sure why PBS* wouldn't allocate the specified number of processes. Check that the number you are specifying with NUM_MPI_PROCS is what you expect it to be.