Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

How to pin Intel MPI processes within Torque cpusets? Set domain issues

shamov_um
Beginner
594 Views

Hi,

I think I have a problem with process pinning, for older version of Intel MPI (4.0.1). The version cannot be changed because it is bundled with the user's application ( Accelrys Material Studio) and there are tons of scripts surrounding it. The code works when started interactively, but when run under the Torque batch system, there are following messages:

[6] MPI startup(): set domain {10,11} fails on node XXX.local
[5] MPI startup(): set domain {9} fails on node XXX.local
[7] MPI startup(): set domain {10,11} fails on node XXX.local
[4] MPI startup(): set domain {9} fails on node XXX.local

The code then fails when is run accross the nodes, or runs slow within a single node. I can run the same code, same data interactibvely accross the nodes, and I dont see teh "set domain" messages.

Our site uses Torque cpusets. So I suspect the difference between running interactively or from a batch script is the cpusets and pinning of the processes.

First question: am I correct? What do these "set domain fails" messages really mean?). Torque gives the list of CPU cores allocated for the job in its cpuset: /dev/cpuset/torque/JOB_ID/cpus will contain something like "8-11" or "0-7". I have tried to pass it to Intel MPI as follows:

range=`cat /dev/cpuset/torque/$PBS_JOBID/cpus`

export I_MPI_PIN=enable

export I_MPI_PIN_PROCS=$range

[... RunMatServer.sh starts ...]

It seems to pin cores to something. And I dont get the "set domain" messages anymore. The second question is, is that a right/correct way to interface Torque cpusets to IntelMPI jobs?

--

Grigory Shamov

University of Manitoba / Westgrid

0 Kudos
3 Replies
shamov_um
Beginner
594 Views
Update: either with pinning, or with I_MPI_PIN=disable, there is the same problem: a job that succesfully runs interactively on two nodes, fails when run under Torque, on the very same nodes, with following error message: Fatal error in PMPI_Bcast: Message truncated, error stack: PMPI_Bcast(1920)..................: MPI_Bcast(buf=0x7fff86db0ac0, count=1, MPI_LONG, root=0, MPI_COMM_WORLD) failed MPIR_Bcast(1236)..................: MPIR_Bcast_Shum_ring(1039)........: MPIDI_CH3U_Receive_data_found(129): Message from rank 3 and tag 2 truncated; 532 bytes received but buffer size is 8 [1:n008] unexpected disconnect completion event from [4:n189] Assertion failed in file ../../dapl_module_util.c at line 2682: 0 ~ I'm at loss as to what might cause the difference.. -- Grigory Shamov
0 Kudos
James_T_Intel
Moderator
594 Views
Hi Grigory, There is currently an incompatibility between the Intel® MPI Library and cpuset. We have some possible workarounds, but they are intended for current versions, and I don't know if they will work on earlier versions. Are you using Hydra or MPD as your process manager? If you are using Hydra, try setting HYDRA_BINDLIB=none and see if that helps. You could also try to fully subscribe each node, which might help avoid the problem. If you can disable cpuset, that should also help. If the ISV software dynamically links to the Intel® MPI Library, you might be able to use the current version. If you install the runtime version (go to http://www.intel.com/go/mpi and select the Runtime for Linux* link on the right), you should be able to link to the current version instead of the older version. This will likely not help in this situation, as the cpuset incompatibility is still present, but when a fix is implemented and release, you should be able to use it. Sincerely, James Tullos Technical Consulting Engineer Intel® Cluster Tools
0 Kudos
shamov_um
Beginner
594 Views
Dear James, Thank you for the reply! I've tried requesting whole nodes, and setting the HYDRA_BINDLIB=none. For the shm:dapl fabrics it made a change that the job doesnt fail immediately , but rather freezes (the processes are there but no output is produced). I'm not sure it is CPUsets or something else here.. -- Grigory Shamov
0 Kudos
Reply