- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have observed a random mpd initialization errors, while this happens the mpi job fails.
There is not much in the logs:[bash]May 20 03:31:42 wn27 mpdman: wn27_mpdman_176 (recv_dict_msg 674):recv_dict_msg raised exception: errmsg=::#012 mpdtb:#012 /opt/inte l/impi/4.0.0.025/intel64/bin/mpdlib.py, 674, recv_dict_msg#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpdman.py, 631, handle_lhs_input#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpdlib.py, 883, handle_active_streams#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpdman.py, 557, run#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py, 3285, launch_mpdman_via_fork#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py, 3173, run_one_cli#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py, 2883, do_mpdrun#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py, 2407, handle_lhs_input#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpdlib.py, 883, handle_active_streams#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py, 1715, runmainloop#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py, 1679, run#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py, 3351, ?#012 mpd_cli_app=/home/produsers/routnwp/nwp_7/bin/cm_w_00.0.0.2.sh#012 cwd=/home/produsers/routnwp/nwp_7/workdir/2010052000sta_lm140.033058 May 20 03:31:42 wn27 mpdman: wn27_mpdman_176 (send_dict_msg 766):send_dict_msg raised exception: sock=lhs errmsg=:(9, 'Bad file descriptor'):#012 mpdtb:#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpdlib.py, 766, send_dict_msg#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpdman.py, 990, handle_rhs_input#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpdlib.py, 883, handle_active_streams#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpdman.py, 557, run#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py, 3285, launch_mpdman_via_fork#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py, 3173, run_one_cli#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py, 2883, do_mpdrun#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py, 2407, handle_lhs_input#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpdlib.py, 883, handle_active_streams#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py, 1715, runmainloop#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py, 1679, run#012 /opt/intel/impi/4.0.0.025/intel64/bin/mpd.py, 3351, ?#012 mpd_cli_app=/home/produsers/routnwp/nwp_7/bin/cm_w_00.0.0.2.sh#012 cwd=/home/produsers/routnwp/nwp_7/workdir/2010052000sta_lm140.033058[/bash]
I also have noticed that after using mpirun mpd is left running on compute nodes. Is it a regular behaviour? How to avoid this?
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Rafal,
Could you submit a tracker via http://premier.intel.com
Please provide all needed information so that we could reproduce the issue. It would be nice if you could send the application you run. All log files also can be useful.
mpirun calls mpdallexit at the end, so all mpds should be killed at the end of mpirun. The reason of the running mpds is abnormal termination of the mpirun script.
Rafal, are you sure that your network is stable? Could you check your mpd ring? Create a ring: 'mpdboot -r ssh -n #'. After that run: 'mpdringtest 1000'
Also you can use mpdcheck utility.
Best wishes,
Dmitry
Could you submit a tracker via http://premier.intel.com
Please provide all needed information so that we could reproduce the issue. It would be nice if you could send the application you run. All log files also can be useful.
mpirun calls mpdallexit at the end, so all mpds should be killed at the end of mpirun. The reason of the running mpds is abnormal termination of the mpirun script.
Rafal, are you sure that your network is stable? Could you check your mpd ring? Create a ring: 'mpdboot -r ssh -n #'. After that run: 'mpdringtest 1000'
Also you can use mpdcheck utility.
Best wishes,
Dmitry
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Dmitry
I've submitted this issue as you requested (IN590360).
What's the correct way to clean up all user mpd processes? mpdallexit, mpdcleanup? I believe that old mpds could influence on new jobs, right?
Our networks AFAIK works pretty stable, here are the results of the tests:
[bash]$ mpdboot -r ssh -n 32 $ mpdringtest 1000 time for 1000 loops = 7.05830812454 seconds $ mpdringtest 10000 time for 10000 loops = 69.5739929676 seconds[/bash]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Rafal,
Theoretically mpds left from previous run may influence on new job, but shouldn't because they are listening to another ports.
A good way to kill all mpds is 'mpdcleanup -a' - you can add it at the end of your script.
See also communication in tracker.
Regards!
Dmitry
Theoretically mpds left from previous run may influence on new job, but shouldn't because they are listening to another ports.
A good way to kill all mpds is 'mpdcleanup -a' - you can add it at the end of your script.
See also communication in tracker.
Regards!
Dmitry
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page