- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm testing out the cluster toolkit for Linux on em64t and the scalapack Cholesky factorization routine (pdpotrf_) causes a Machine Check Exception and kernel panic very predictably. No other scalapack/pblas routine so far has caused any problems.
Additionally, each time I start an mpi job using 'mpiexec', the mpd daemon outputs the following:
unable to parse pmi message from the process :cmd=put kvsname=kvs_nerf_4268_1_0 key=DAPL_PROVIDER value=
It may not matter, but the only way I've found to get the mpd ring running is:
[host]$ mpd --ifhn=10.0.0.1 -l 4268 &
[node]$ mpd -h nerf -p 4268 &
Any help clearing these up would be greatly appreciated, as I won't be purchasing the software otherwise. MCEs are unacceptable.
Additionally, each time I start an mpi job using 'mpiexec', the mpd daemon outputs the following:
unable to parse pmi message from the process :cmd=put kvsname=kvs_nerf_4268_1_0 key=DAPL_PROVIDER value=
It may not matter, but the only way I've found to get the mpd ring running is:
[host]$ mpd --ifhn=10.0.0.1 -l 4268 &
[node]$ mpd -h nerf -p 4268 &
Any help clearing these up would be greatly appreciated, as I won't be purchasing the software otherwise. MCEs are unacceptable.
Link Copied
11 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The forum pulled out the last part of the error message:
"value="
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In brackets, "NULL string"
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you clarify the Intel MPI Library version you use? Please check package ID information in the mpisupport.txt file.
By the way, you can fill a bug report at https://primer.intel.com to get a technical assistance.
Best regards,
Andrey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Package ID: l_mpi_p_3.1.026
If I had to guess, I would say it's in pdsyrk. I heavily performance tested pdtrsm and dpotrf before trying pdpotrf.
If I had to guess, I would say it's in pdsyrk. I heavily performance tested pdtrsm and dpotrf before trying pdpotrf.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That link doesn't work.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you give more details on cluster configuration? I'd like to understand why you was not able to use mpdboot to launch MPD ring.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The test cluster consists of 2 4 processor machines behind a firewall. The headnode, nerf, has two ethernet ports, one connected to the firewall, one to the node, ball. All IPs are in the 10.0.0.0 network.
When I try:
mpdboot --totalnum=2 --file=./mpd.hosts --rsh=ssh
the output is:
mpdboot_nerf (handle_mpd_output 681): failed to ping mpd on ball; received output={}
Also, the premier support link won't let me in, as I'm only evaluating the software right now.
When I try:
mpdboot --totalnum=2 --file=./mpd.hosts --rsh=ssh
the output is:
mpdboot_nerf (handle_mpd_output 681): failed to ping mpd on ball; received output={}
Also, the premier support link won't let me in, as I'm only evaluating the software right now.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Is it allowed to establish connection from compute nodes to the head node?mpdboot alwaysstart mpd daemon on local node first .After that remote mpd daemonst attemt to perfrom connection to it?
- Do you able to start mpd manually?
- Run the mpd -e -d command on the head node. The port number will be printed on stdout.
- Run the mpd -h head_node -p
-d command to establish MPD ring. Use the port number printed at pervious step. - Check if ring was established succesfully. Run the mpdtrace command for that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ops! I see that you can start ring manually. Could you share the content of your mpd.hosts file? Could you share the output from mpdboot -d -v... command? Is there any useful information in /tmp/mpd2.logfile_
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
After reconfiguring the network settings several times, and reorganizing all of my environment variables (I had several MPI implementations installed), the problem went away, and I could boot up the MPD daemons via:
mpdboot --file= --rsh=ssh
I wish I could explain more specifically, but I changed far too many things in the process of compiling ScaLAPACK from scratch for several MPI implementations.
mpdboot --file= --rsh=ssh
I wish I could explain more specifically, but I changed far too many things in the process of compiling ScaLAPACK from scratch for several MPI implementations.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page