Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Girish_Nair
Beginner
159 Views

pmi_proxy stalls the HPC job

Hi HPC enthusiasts,
We are having a Sandy Bridge cluster of 8 nodes having the following:

Hardware:
1U rackmount enclosure
Intel S2400SC2 board
2 x Xeon E5-2450 processor
96GB ECC DDR3 RDIMM
Intel True Scale QLE7340-CK HCA
500GB Enterprise SATA
36 port QLogic switch
24-port 1GbE switch

Software:
CentOS 6.2 x64
Intel MPI Library 4.1.1.036
Intel Fortran Composer XE 2013.3.163
NetCDF 4.0
FFTW 3.3.3
Open Grid Engine 2011.11.p1
NFS share
Passphraseless SSH from any machine to any machine (meshed)

Of late, whenever we submit the job (home-grown code) either via mpirun direct or through Grid Engine qsub, invariably (~90% times) the job does not start execution, it just appears to stay stalled. On inspection of process runs, we find that randomly few nodes shows 'pmi_proxy' with status 'D' (uninterruptible sleep).

We have tested IMB (Intel MPI Benchmark), test codes (that comes with Grid Engine and Intel MPI) on the cluster both via mpirun and also through qsub, and it functions fine.

What is pmi_proxy process, and how to eliminate stalling of job. Non-functioning of job is driving me crazy. Please excuse me if it is already discussed somewhere, or, if this is not the correct forum. I'm a new novice HPC user.

Any guidance would be appreciated.

My advance thanks for an early and valuable suggestion(s).

With regards
Girish Nair
+91 98457 36460
girishnairisonline <at> gmail <dot> com

0 Kudos
3 Replies
James_T_Intel
Moderator
159 Views

Hi Girish,

pmi_proxy is part of Hydra.  It seems the test codes are running as expected.  Can you provide more details on the program you are attempting to run when it hangs?  Also, if you can provide the output with

[plain]I_MPI_DEBUG=5

I_MPI_HYDRA_DEBUG=1[/plain]

when it hangs, that could help.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel® Cluster Tools

Girish_Nair
Beginner
159 Views

Hi James,
Thanks for your effort to support.

The code is on MHD (Magneto Hydro Dynamics) that deals with the study of magnetic forces of earth. It basically works on compressed data using NetCDF and also provides compressed results to be read by NetCDF. The code is home-grown written on Fortran 90 and uses libraries like FFTW, MKL etc.

The equivalent open source code in my opinion could be 'Pencil Code' (http://pencil-code.nordita.org/)

As you've suggested, I'll run the code with the following parameters:

I_MPI_DEBUG=5  I_MPI_HYDRA_DEBUG=1

And I'll publish the output.

Additionally, do you think using mpdboot instead of hydra might help?

With regards
Girish Nair
+91 98457 36460
girishnairisonline <at> gmail <dot> com

James_T_Intel
Moderator
159 Views

If it does run under MPD and not under Hydra, we need to know about that so we can get it corrected.  We are trying to move to Hydra and away from MPD.

Reply