When i use qdel command with torque ant indelmpi 4.0 the process desappear in torque schedule but in the node the process still runnning, only desappear the parent process (mpdboot) but not the executable.
This is a problem of mpi or torque?
Somebody can help me please.
Unfortunately, the Intel MPI Library does not offer tight integration with the Torque scheduler at this time (but some is coming in the next version). The following discussion seems to relate to your problem, although the customer there is using MPICH.
Starting with Intel MPI Library 4.0, we've added support for the Hydra scheduler (as part of the new MPICH2 Nemesis architecture), which might help in your case. I'd suggest taking a look at the Hydra section of the Reference Manual (located in the
Alternatively, you can take a look at OSC's mpiexec tool. It has support for Intel MPI Library binaries as well.
By the way, two aditional questions :
1.- What is the best scheduler in order to use intel-mpi?
2.- The next release, have a planned date? It is sure that have a best integration with torque?
I'm glad to hear that you're getting really good performance results with the Intel MPI Library, at least :)
1. There's really no best scheduler for the Intel MPI Library. We support all of them fairly consistently. We've had pretty good experiences with some of the more popular ones - PBS Pro, LSF, etc. but, in general, support should be uniform.
2. Tentatively, our next release will be in time for the SC10 conference in New Orleans, LA. So, late October or early November.
In the meantime, I would still suggest you give the new mpiexec.hydra a try. That's included in your current 4.0 version of the library.
Unfortunatelly, mpiexec.hydra have the same problem with torque, when a job not finish well, the process still zombies or sleep and using CPU.
I will tray with mpiexec the next week,
Thanks for everything
we have the same problems as you and these are common issues as Intel-MPI layer does not use the TM protocol which is the native way Torque/PBS start up remote tasks in a cluster. The problem of lack of integration with the batch scheduler has several ramifications. The scheduler cannot track resource usage so jobs abusing memory for instance cannot be automatically killed by the batch system. Another more subtle issue is that a job cannot be preempted/suspended and resumed as the scheduler does not know to which processes to apply job control.
This is a show stopper for HPC centers which apply heavy scheduling to make scarce cluster resources be better utilized.
As Gergana mentioned, you can download the Ohio Supercomputer Center PBS mpiexec command which will replace Intel mpirun. We have used this to successfully launch and track jobs using Intel-MPI with Torque/Maui. However it is something external to both MPI and the scheduler so we are a little reserved to advertise it wholesale to our users.
I hope that we will see a native Torque/Intel-MPI integration and I am sure a long list of centers wish for that. In my opinion, the PMI / python "protocol" should be banned.....