Hi Gilles,

Alin_M_Elena · ‎11-29-2013

I have just installed intel-mpi 4.1.2.040 onf a cluster...

If I used mpiexec.hydra to start jobs one per node... it still spawns processes on all available resources...

mpiexec.hydra -ppn 1 hostname

on two nodes will show me 40 lines as oppose to only two expected.

I have added a file with debug info when running

I_MPI_HYDRA_DEBUG=1 mpiexec.hydra -ppn 1 hostname 2>&1 | tee debug.txt

regards,

Alin

Alin_M_Elena · ‎11-29-2013

Forgot to say! Any help in solving the issue or better understanding it much appreciated.

regards,

Alin

James_T_Intel · ‎12-03-2013

Hi Alin,

Using -ppn will not limit the total number of ranks on a host, simply the number of consecutive ranks on each host. If you have too many ranks, the placement will cycle back to the first host and begin again. So if I have a hostfile with two hosts (node0 and node1), here's what I should see:

[plain]$mpirun -n 4 -ppn 2 ./hello

Hello world: rank 0 of 4 running on node0

Hello world: rank 1 of 4 running on node0

Hello world: rank 2 of 4 running on node1

Hello world: rank 3 of 4 running on node1

$mpirun -n 4 -ppn 1 ./hello

Hello world: rank 0 of 4 running on node0

Hello world: rank 1 of 4 running on node1

Hello world: rank 2 of 4 running on node0

Hello world: rank 3 of 4 running on node1[/plain]

In your command line, you didn't specify the number of ranks to run. If you don't specify that number, it will be determined from your job (or if that can't be found, then the number of cores available on the host). In this case, your job says to use 40 ranks, so 40 ranks were launched.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Alin_M_Elena · ‎12-03-2013

Hi James, Thank you for you answer. I cannot still reproduce your results with the above approach... mpiexec.hydra is the one provided by the current version of the intel mpi library... mpiexec.hydra.good is from 4.0 as you can see one offers the right output the other not. Also the presence of -n in the past was not mandatory but maybe I missed something in the manual. I have also attached the nodes file. The same happens when using mpirun. [alin@service56:~]: mpiexec.hydra -n 4 -ppn 1 ./hello.X I am process 0 out of 4 running on service56 with MPI version 2.2 I am process 1 out of 4 running on service56 with MPI version 2.2 I am process 3 out of 4 running on service56 with MPI version 2.2 I am process 2 out of 4 running on service56 with MPI version 2.2 [alin@service56:~]: mpiexec.hydra -n 4 -ppn 2 ./hello.X I am process 1 out of 4 running on service56 with MPI version 2.2 I am process 0 out of 4 running on service56 with MPI version 2.2 I am process 3 out of 4 running on service56 with MPI version 2.2 I am process 2 out of 4 running on service56 with MPI version 2.2 [alin@service56:~]: mpirun -n 4 -ppn 1 ./hello.X I am process 1 out of 4 running on service56 with MPI version 2.2 I am process 0 out of 4 running on service56 with MPI version 2.2 I am process 2 out of 4 running on service56 with MPI version 2.2 I am process 3 out of 4 running on service56 with MPI version 2.2 [alin@service56:~]: mpirun -n 4 -ppn 2 ./hello.X I am process 1 out of 4 running on service56 with MPI version 2.2 I am process 0 out of 4 running on service56 with MPI version 2.2 I am process 3 out of 4 running on service56 with MPI version 2.2 I am process 2 out of 4 running on service56 with MPI version 2.2 [alin@service56:~]: mpiexec.hydra.good -n 4 -ppn 2 ./hello.X I am process 0 out of 4 running on service56 with MPI version 2.2 I am process 1 out of 4 running on service56 with MPI version 2.2 I am process 2 out of 4 running on service54 with MPI version 2.2 I am process 3 out of 4 running on service54 with MPI version 2.2 [alin@service56:~]: mpiexec.hydra.good -n 4 -ppn 1 ./hello.X I am process 0 out of 4 running on service56 with MPI version 2.2 I am process 2 out of 4 running on service56 with MPI version 2.2 I am process 3 out of 4 running on service54 with MPI version 2.2 I am process 1 out of 4 running on service54 with MPI version 2.2 [alin@service56:~]: mpiexec.hydra.good -ppn 1 ./hello.X I am process 0 out of 2 running on service56 with MPI version 2.2 I am process 1 out of 2 running on service54 with MPI version 2.2 [alin@service56:~]: mpiexec.hydra.good -ppn 2 ./hello.X I am process 0 out of 4 running on service56 with MPI version 2.2 I am process 2 out of 4 running on service54 with MPI version 2.2 I am process 3 out of 4 running on service54 with MPI version 2.2 I am process 1 out of 4 running on service56 with MPI version 2.2 [alin@service56:~]: cat $PBS_NODEFILE > nodes.txt I looked more into I_MPI_HYDRA_DEBUG=1 mpiexec.hydra.good -n 4 -ppn 2 ./hello.X > good I_MPI_HYDRA_DEBUG=1 mpiexec.hydra -n 4 -ppn 2 ./hello.X > bad attached them both. looking into them I find these differences that may help to understand the issue [alin@abaddon:~]: grep -A 3 "Proxy information" bad Proxy information: ********************* [1] proxy: service56 (20 cores) Exec list: ./hello.X (4 processes); [alin@abaddon:~]: grep -A 6 "Proxy information" good Proxy information: ********************* [1] proxy: service56 (2 cores) Exec list: ./hello.X (2 processes); [2] proxy: service54 (2 cores) Exec list: ./hello.X (2 processes); more the arguments passed to the proxy are different... [alin@abaddon:~]: grep -A 2 "Arguments being" good Arguments being passed to proxy 0: --version 1.4.1p1 --iface-ip-env-name MPICH_INTERFACE_HOSTNAME --hostname service56 --global-core-map 0,2,2 --filler-process-map 0,2,2 --global-process-count 4 --auto-cleanup 1 --pmi-rank -1 --pmi-kvsname kvs_38696_0 --pmi-process-mapping (vector,(0,2,2)) --topolib ipl --ckpointlib blcr --ckpoint-prefix /tmp --ckpoint-preserve -1 --ckpoint off --ckpoint-num -1 --global-inherited-env 117 'I_MPI_PERHOST=allcores' 'I_MPI_ROOT=/ichec/home/packages/intel-cluster-studio/2013-sp1-u1/impi/4.1.2.040' 'COLORTERM=1' 'PBS_O_PATH=/ichec/home/staff/alin/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/opt/c3/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/opt/mam/bin:/opt/moab/bin:/opt/moab/sbin:.:/opt/sgi/sbin:/opt/sgi/bin' 'module=() { eval `/usr/bin/modulecmd bash $*` }' '_=/ichec/home/packages/intel-cluster-studio/2013-sp1-u1/impi/4.1.2.040/intel64/bin/mpiexec.hydra.good' --global-user-env 0 --global-system-env 2 'MPICH_ENABLE_CKPOINT=1' 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 2 --exec --exec-appnum 0 --exec-proc-count 2 --exec-local-env 0 --exec-wdir /ichec/home/staff/alin --exec-args 1 ./hello.X -- Arguments being passed to proxy 1: --version 1.4.1p1 --iface-ip-env-name MPICH_INTERFACE_HOSTNAME --hostname service54 --global-core-map 2,2,0 --filler-process-map 2,2,0 --global-process-count 4 --auto-cleanup 1 --pmi-rank -1 --pmi-kvsname kvs_38696_0 --pmi-process-mapping (vector,(0,2,2)) --topolib ipl --ckpointlib blcr --ckpoint-prefix /tmp --ckpoint-preserve -1 --ckpoint off --ckpoint-num -1 --global-inherited-env 117 'I_MPI_PERHOST=allcores' 'I_MPI_ROOT=/ichec/home/packages/intel-cluster-studio/2013-sp1-u1/impi/4.1.2.040' 'COLORTERM=1' 'PBS_O_PATH=/ichec/home/staff/alin/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/opt/c3/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/opt/mam/bin:/opt/moab/bin:/opt/moab/sbin:.:/opt/sgi/sbin:/opt/sgi/bin' 'module=() { eval `/usr/bin/modulecmd bash $*` }' '_=/ichec/home/packages/intel-cluster-studio/2013-sp1-u1/impi/4.1.2.040/intel64/bin/mpiexec.hydra.good' --global-user-env 0 --global-system-env 2 'MPICH_ENABLE_CKPOINT=1' 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 2 --exec --exec-appnum 0 --exec-proc-count 2 --exec-local-env 0 --exec-wdir /ichec/home/staff/alin --exec-args 1 ./hello.X [alin@abaddon:~]: [alin@abaddon:~]: [alin@abaddon:~]: grep -A 2 "Arguments being" bad Arguments being passed to proxy 0: --version 1.4.1p1 --iface-ip-env-name MPICH_INTERFACE_HOSTNAME --hostname service56 --global-core-map 0,20,0 --filler-process-map 0,20,0 --global-process-count 4 --auto-cleanup 1 --pmi-rank -1 --pmi-kvsname kvs_38714_0 --pmi-process-mapping (vector,(0,2,20)) --topolib ipl --ckpointlib blcr --ckpoint-prefix /tmp --ckpoint-preserve 1 --ckpoint off --ckpoint-num -1 --global-inherited-env 117 'I_MPI_PERHOST=allcores' 'I_MPI_ROOT=/ichec/home/packages/intel-cluster-studio/2013-sp1-u1/impi/4.1.2.040' 'COLORTERM=1' 'PBS_O_PATH=/ichec/home/staff/alin/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/opt/c3/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/opt/mam/bin:/opt/moab/bin:/opt/moab/sbin:.:/opt/sgi/sbin:/opt/sgi/bin' 'module=() { eval `/usr/bin/modulecmd bash $*` }' '_=/ichec/home/packages/intel-cluster-studio/2013-sp1-u1/impi/4.1.2.040/intel64/bin/mpiexec.hydra' --global-user-env 0 --global-system-env 2 'MPICH_ENABLE_CKPOINT=1' 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 20 --exec --exec-appnum 0 --exec-proc-count 4 --exec-local-env 0 --exec-wdir /ichec/home/staff/alin --exec-args 1 ./hello.X If I collapse my hostfile into uniq hosts and use the -f I get the correct behaviour with or without -n. Did the behaviour between versions of intel-mpi change or this is bug? regards, Alin regards, Alin

James_T_Intel · ‎12-04-2013

Hi Alin,

What is the full version number for the working one?

James.

Gilles_C_ · ‎12-04-2013

Hi James,

Alin not being available right now, I'll answer the question.

The Intel MPI version the working mpiexec.hydra comes from is 4.1.0.024.
More precisely, it says: Intel(R) MPI Library for Linux* OS, Version 4.1.0 Build 20120831

Cheers.

Gilles

Gilles_C_ · ‎12-17-2013

Hi,

Any chance to see an update on this issue? From the outside it looks so trivially a regression in the mpiexec.hydra, yet it is so annoying from a user's point of view... Do I miss some critical element here?

Although using an old version of it allows to run, it might have some unexpected side effects we don't see. Moreover, since we plan using intensively symmetric MPI mode on Xeon phi, being in a clean and up-to-date Intel MPI environment would be a highly desirable.

Cheers.

Gilles

James_T_Intel · ‎12-31-2013

Hi Gilles,

I currently do not have any additional information about this issue. Several other customers are reporting it. I can suggest using a machinefile as a workaround, or specifying a different hostfile, rather than allowing Hydra to automatically get the hosts from your job scheduler.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Jazmín_Yanel_J_ · ‎01-05-2015

I'm working in a cluster and learning how to send different process. Today I tried to use a script with the command to execute the program. Suddenly, when I use the command top appears:

28210 jazmin 25 0 13088 928 712 R 100.2 0.0 383:56.27 mpiexec.hydra

and I cannot kill this process, how can I do it? thanks in advance

mpiexec.hydra -ppn 1 and intel-mpi 4.1.2.040