Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2154 Discussions

Sun Grid engine tight integration for Intel mpi

Rene_S_1
Beginner
1,574 Views
Hi,

Are there any plans to have a version of Intel mpi that has tight integration support for the sun gridengine queuing system much in the same way as openmpi has the support now?

Thanks
Rene
0 Kudos
13 Replies
Andrey_D_Intel
Employee
1,574 Views
Hi Rene,

Yes, we consider possibility to include such functionality to our product.

Actually, I may provide you some current recommendations how to configure SGEto reach tight integration with Intel MPI Library. Just let me know if you are interesting in it.

Best regards,
Andrey
0 Kudos
Gergana_S_Intel
Employee
1,574 Views
Hi Rene,

As Andrey mentioned, we do havea "manual", in a way, of how to integrate Intel MPI with Sun Grid Engine. The set of instructions are now available online at:

http://software.intel.com/en-us/articles/integrating-intel-mpi-sge/

Let us know if this helps, or if you have any questions or problems.

Regards,
~Gergana
0 Kudos
Rene_S_1
Beginner
1,574 Views
Hi Rene,

As Andrey mentioned, we do havea "manual", in a way, of how to integrate Intel MPI with Sun Grid Engine. The set of instructions are now available online at:

http://software.intel.com/en-us/articles/integrating-intel-mpi-sge/

Let us know if this helps, or if you have any questions or problems.

Regards,
~Gergana


Sorry was out of town for a few days and just getting back to this. Thanks Andrey and Gernana! I will look over the manual instructions and give a try and let you know how it goes.

Rene

0 Kudos
Rene_S_1
Beginner
1,574 Views
Hi Rene,

As Andrey mentioned, we do havea "manual", in a way, of how to integrate Intel MPI with Sun Grid Engine. The set of instructions are now available online at:

http://software.intel.com/en-us/articles/integrating-intel-mpi-sge/

Let us know if this helps, or if you have any questions or problems.

Regards,
~Gergana


Gergana/Andrey,

We followed the directions on the website and setup SGE as suggested by you for tight integration with intel mpi. One of the reasons we are looking to do this is so that SGE can do proper clean up the the MPD python deamons that get left running around on servers after a job gets deleted or killed.

For example with openmpi and sge tight integration all openmpi processes get forked as children of the SGE execd deamon. So when a job gets deleted or killed SGE has full control of the job and can terminate all its openmpi children and clean up.

With intel mpi here is what I see when I submit a job:

grdadmin 4788 1 4788 4694 0 Mar30 ? 00:02:00 /hpc/SGE/bin/lx24-amd64/sge_execd
root 4789 4788 4788 4694 0 Mar30 ? 00:04:15 /bin/ksh /usr/local/bin/load.sh
grdadmin 16949 4788 16949 4694 0 09:33 ? 00:00:00 sge_shepherd-1712429 -bg
salmr0 17023 16949 17023 17023 1 09:33 ? 00:00:00 -csh /var/spool/SGE/hpcp7781/job_scripts/1712429
salmr0 17127 17023 17023 17023 0 09:33 ? 00:00:00 /bin/sh /hpc/soft/intel/x86_64/ict-3.1.1/impi/3.1/bin64/mpirun -perhost 1 -env I
salmr0 17174 17127 17023 17023 1 09:33 ? 00:00:00 python /hpc/soft/intel/x86_64/ict-3.1.1/impi/3.1/bin64/mpiexec -perhost 1 -env
salmr0 17175 17174 17023 17023 1 09:33 ? 00:00:00 [sh]

.
.
.
salmr0 17166 1 17165 17165 0 09:33 ? 00:00:00 python /hpc/soft/intel/x86_64/ict-3.1.1/impi/3.1/bin64/mpd.py --ncpus=1 --myhost=hpcp7
salmr0 17176 17166 17176 17165 2 09:33 ? 00:00:00 python /hpc/soft/intel/x86_64/ict-3.1.1/impi/3.1/bin64/mpd.py --ncpus=1 --myhost=hpc
salmr0 17178 17176 17178 17165 87 09:33 ? 00:00:04 /bphpc7/vol0/salmr0/MPI-Bench/bin/x86_64/IMB-MPI1.intelmpi.3.1



As you can see my MPI job is running as a forked child of sgeexcd and it under full SGE control. However the MPDs that got started are totally independent precesses and are not forked children of SGE. The problem comes when i type qdelete or try to delete my job or kill it as it is running. At this point SGE will killl all its forked children. But it know nothing about the MPD deamos. As a result after SGE deletes, kills, and cleans up my job I still have this running around on all the nodes that ran the mpi job:

salmr0 17166 1 17165 17165 0 09:33 ? 00:00:00 python /hpc/soft/intel/x86_64/ict-3.1.1/impi/3.1/bin64/mpd.py --ncpus=1 --myhost=hpcp7

Each time i submit and delete a job I would get a new python like above hanging around. Any ideas on how to get the clean up of MPDs working properly?

Thanks
Rene






0 Kudos
bleedinge
Beginner
1,574 Views
Quoting - salmr0
Hi Rene,

As Andrey mentioned, we do havea "manual", in a way, of how to integrate Intel MPI with Sun Grid Engine. The set of instructions are now available online at:

http://software.intel.com/en-us/articles/integrating-intel-mpi-sge/

Let us know if this helps, or if you have any questions or problems.

Regards,
~Gergana


Gergana/Andrey,

We followed the directions on the website and setup SGE as suggested by you for tight integration with intel mpi. One of the reasons we are looking to do this is so that SGE can do proper clean up the the MPD python deamons that get left running around on servers after a job gets deleted or killed.

For example with openmpi and sge tight integration all openmpi processes get forked as children of the SGE execd deamon. So when a job gets deleted or killed SGE has full control of the job and can terminate all its openmpi children and clean up.

With intel mpi here is what I see when I submit a job:

grdadmin 4788 1 4788 4694 0 Mar30 ? 00:02:00 /hpc/SGE/bin/lx24-amd64/sge_execd
root 4789 4788 4788 4694 0 Mar30 ? 00:04:15 /bin/ksh /usr/local/bin/load.sh
grdadmin 16949 4788 16949 4694 0 09:33 ? 00:00:00 sge_shepherd-1712429 -bg
salmr0 17023 16949 17023 17023 1 09:33 ? 00:00:00 -csh /var/spool/SGE/hpcp7781/job_scripts/1712429
salmr0 17127 17023 17023 17023 0 09:33 ? 00:00:00 /bin/sh /hpc/soft/intel/x86_64/ict-3.1.1/impi/3.1/bin64/mpirun -perhost 1 -env I
salmr0 17174 17127 17023 17023 1 09:33 ? 00:00:00 python /hpc/soft/intel/x86_64/ict-3.1.1/impi/3.1/bin64/mpiexec -perhost 1 -env
salmr0 17175 17174 17023 17023 1 09:33 ? 00:00:00 [sh]

.
.
.
salmr0 17166 1 17165 17165 0 09:33 ? 00:00:00 python /hpc/soft/intel/x86_64/ict-3.1.1/impi/3.1/bin64/mpd.py --ncpus=1 --myhost=hpcp7
salmr0 17176 17166 17176 17165 2 09:33 ? 00:00:00 python /hpc/soft/intel/x86_64/ict-3.1.1/impi/3.1/bin64/mpd.py --ncpus=1 --myhost=hpc
salmr0 17178 17176 17178 17165 87 09:33 ? 00:00:04 /bphpc7/vol0/salmr0/MPI-Bench/bin/x86_64/IMB-MPI1.intelmpi.3.1



As you can see my MPI job is running as a forked child of sgeexcd and it under full SGE control. However the MPDs that got started are totally independent precesses and are not forked children of SGE. The problem comes when i type qdelete or try to delete my job or kill it as it is running. At this point SGE will killl all its forked children. But it know nothing about the MPD deamos. As a result after SGE deletes, kills, and cleans up my job I still have this running around on all the nodes that ran the mpi job:

salmr0 17166 1 17165 17165 0 09:33 ? 00:00:00 python /hpc/soft/intel/x86_64/ict-3.1.1/impi/3.1/bin64/mpd.py --ncpus=1 --myhost=hpcp7

Each time i submit and delete a job I would get a new python like above hanging around. Any ideas on how to get the clean up of MPDs working properly?

Thanks
Rene







Did you ever came up with a solution for this?
0 Kudos
nixter
Beginner
1,574 Views
Quoting - bleedinge

Did you ever came up with a solution for this?

I have the same problem, any solution for this problem?

thanks.
0 Kudos
Sangamesh_B_
Beginner
1,574 Views
I'm curious to know why Intel developed their MPI based on MPICH2/MVAPICH2. Why not based on OpenMPI?

- Sangamesh
0 Kudos
TimP
Honored Contributor III
1,574 Views
OpenMPI was not well developed, and had not supplanted lam, at the time the decision was made, and didn't support Windows until recently. Not all subsequent developments were foreseen. Are you suggesting that cooperative developments between OpenMPI and SGE should have been foreseen? Do you know the future of SGE?
0 Kudos
eev
Beginner
1,574 Views
Quoting nixter
Quoting - bleedinge

Did you ever came up with a solution for this?

I have the same problem, any solution for this problem?

thanks.


I have the same problem, too. What can I do?
0 Kudos
Gergana_S_Intel
Employee
1,574 Views

Hello everyone,

I'm hoping this reply will reach everyone subscribed to this thread.

As a first point of business, I would suggest you give the new Intel MPI Library 4.0 a try. It came out last month and includes quite a few major changes. You can download it, if you still have a valid license, from the Intel Registration Center, or grab an eval copy from intel.com/go/mpi.

Secondly, we have plans to improve our tight integration support with SGE and other schedulers in future releases. So stay tuned.

Regards,
~Gergana

0 Kudos
reuti_at_intel
1,574 Views
Hi,

please have a look at:

http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html

for a tight integration with correct accounting and control of all slave tasks by SGE. The Howto was written originally for MPICH2. As Intel MPI is based on MPICH2, the "mpd startup method" also applies to Intel MPI.

-- Reuti
0 Kudos
jbp1atdukeedu
Beginner
1,574 Views
Reuit -- Looks like a bad link ... maybe the new gridengine.org has it?

John
0 Kudos
jbp1atdukeedu
Beginner
1,574 Views
Not exactly what you're looking for, but you can hack the Intel "stock" mpirun script to do a better job of tight integration. A version that I hacked together is available at:

As was noted elsewhere, if the process detaches from sge_shepherd then you've lost tight integration. The script above should keep open connections to each child process -- so they all stay attached to sge_shepherd.
0 Kudos
Reply