mpd initialization error

Rafał_Błaszczyk · ‎05-20-2010

We have observed a random mpd initialization errors, while this happens the mpi job fails.

There is not much in the logs:

[bash]May 20 03:31:42 wn27 mpdman: wn27_mpdman_176 (recv_dict_msg 674):recv_dict_msg raised exception: errmsg=::#012  mpdtb:#012    /opt/inte
l/impi/4.0.0.025/intel64/bin/mpdlib.py,  674,  recv_dict_msg#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpdman.py,  631,  handle_lhs_input#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpdlib.py,  883,  handle_active_streams#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpdman.py,  557,  run#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py,  3285,  launch_mpdman_via_fork#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py,  3173,  run_one_cli#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py,  2883,  do_mpdrun#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py,  2407,  handle_lhs_input#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpdlib.py,  883,  handle_active_streams#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py,  1715,  runmainloop#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py,  1679,  run#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py,  3351,  ?#012    mpd_cli_app=/home/produsers/routnwp/nwp_7/bin/cm_w_00.0.0.2.sh#012    cwd=/home/produsers/routnwp/nwp_7/workdir/2010052000sta_lm140.033058
May 20 03:31:42 wn27 mpdman: wn27_mpdman_176 (send_dict_msg 766):send_dict_msg raised exception: sock=lhs errmsg=:(9, 'Bad file descriptor'):#012  mpdtb:#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpdlib.py,  766,  send_dict_msg#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpdman.py,  990,  handle_rhs_input#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpdlib.py,  883,  handle_active_streams#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpdman.py,  557,  run#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py,  3285,  launch_mpdman_via_fork#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py,  3173,  run_one_cli#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py,  2883,  do_mpdrun#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py,  2407,  handle_lhs_input#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpdlib.py,  883,  handle_active_streams#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py,  1715,  runmainloop#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py,  1679,  run#012    /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py,  3351,  ?#012    mpd_cli_app=/home/produsers/routnwp/nwp_7/bin/cm_w_00.0.0.2.sh#012    cwd=/home/produsers/routnwp/nwp_7/workdir/2010052000sta_lm140.033058[/bash]

I also have noticed that after using mpirun mpd is left running on compute nodes. Is it a regular behaviour? How to avoid this?

Dmitry_K_Intel2 · ‎05-20-2010

Hi Rafal,

Could you submit a tracker via http://premier.intel.com
Please provide all needed information so that we could reproduce the issue. It would be nice if you could send the application you run. All log files also can be useful.

mpirun calls mpdallexit at the end, so all mpds should be killed at the end of mpirun. The reason of the running mpds is abnormal termination of the mpirun script.

Rafal, are you sure that your network is stable? Could you check your mpd ring? Create a ring: 'mpdboot -r ssh -n #'. After that run: 'mpdringtest 1000'
Also you can use mpdcheck utility.

Best wishes,
Dmitry

Rafał_Błaszczyk · ‎05-21-2010

Hi Dmitry

I've submitted this issue as you requested (IN590360).

What's the correct way to clean up all user mpd processes? mpdallexit, mpdcleanup? I believe that old mpds could influence on new jobs, right?

Our networks AFAIK works pretty stable, here are the results of the tests:

[bash]$ mpdboot -r ssh -n 32
$ mpdringtest 1000
time for 1000 loops = 7.05830812454 seconds
$ mpdringtest 10000
time for 10000 loops = 69.5739929676 seconds[/bash]

Dmitry_K_Intel2 · ‎05-24-2010

Rafal,

Theoretically mpds left from previous run may influence on new job, but shouldn't because they are listening to another ports.

A good way to kill all mpds is 'mpdcleanup -a' - you can add it at the end of your script.

See also communication in tracker.

Regards!
Dmitry