Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
1920 Discussions

Ask for suggest to configure and run parallel program in cluster

Zhanghong_T_
Novice
128 Views

Dear all,

I have a cluster with two kinds of nodes joined into parallel calculation: the first kind is the nodes with 2 CPUs and 4 cores in every CPU, the memory in every node is 32 GB, the second kind is the nodes with 4 CPUs and 8 cores in every CPU, the memory in every node is 256 GB. All nodes have Windows Server 2008 HPC in stalled and they are all joined into one domain controlled by another node (which is not joined into the calculation). I launched the job by the following command:

mpiexec -wdir D:\Users\tang\Debug -mapall -hosts 18 n01 2 n02 2 n03 2 n04 2 n05 2 n06 2 n07 2 n08 2 n09 2 n10 2 n11 2 n12 2 n13 2 n14 2 n15 2 n16 2 m01 16 m02 16 D:\Users\tang\Debug\fem

The job failed by the error that MPI_Init failed. However, if I use any of the following commands (n01 etc are first kind of nodes and m01 etc are second kind of nodes):

mpiexec -wdir D:\Users\tang\Debug -mapall -hosts 16 n01 2 n02 2 n03 2 n04 2 n05 2 n06 2 n07 2 n08 2 n09 2 n10 2 n11 2 n12 2 n13 2 n14 2 n15 2 n16 2 D:\Users\tang\Debug\fem

mpiexec -wdir D:\Users\tang\Debug -mapall -hosts 2 m01 16 m02 16 D:\Users\tang\Debug\fem

The job is successfully launched.

Is there anything I missed in using mpiexec to launch this kind of job?

Thanks,

Zhanghong Tang

0 Kudos
2 Replies
Zhanghong_T_
Novice
128 Views

Sorry, I have shared a folder to all nodes and in every node I have mapped the shared folder to Z driver and the address should be Z:\ instead of D:\.

 

James_T_Intel
Moderator
128 Views

The MPI_Init error should have some additional information with it that can help determine the cause of this problem.  My immediate thoughts are to verify connectivity between the nXX and mXX nodes.

Reply