Community
cancel
Showing results for 
Search instead for 
Did you mean: 
poulson_jack
Beginner
75 Views

impi/mkl_scalapack causing kernel panic

I'm testing out the cluster toolkit for Linux on em64t and the scalapack Cholesky factorization routine (pdpotrf_) causes a Machine Check Exception and kernel panic very predictably. No other scalapack/pblas routine so far has caused any problems.

Additionally, each time I start an mpi job using 'mpiexec', the mpd daemon outputs the following:
unable to parse pmi message from the process :cmd=put kvsname=kvs_nerf_4268_1_0 key=DAPL_PROVIDER value=

It may not matter, but the only way I've found to get the mpd ring running is:
[host]$ mpd --ifhn=10.0.0.1 -l 4268 &
[node]$ mpd -h nerf -p 4268 &

Any help clearing these up would be greatly appreciated, as I won't be purchasing the software otherwise. MCEs are unacceptable.
0 Kudos
11 Replies
poulson_jack
Beginner
75 Views

The forum pulled out the last part of the error message: "value="
poulson_jack
Beginner
75 Views

In brackets, "NULL string"
Andrey_D_Intel
Employee
75 Views

Hi,

Could you clarify the Intel MPI Library version you use? Please check package ID information in the mpisupport.txt file.

By the way, you can fill a bug report at https://primer.intel.com to get a technical assistance.

Best regards,

Andrey

poulson_jack
Beginner
75 Views

Package ID: l_mpi_p_3.1.026

If I had to guess, I would say it's in pdsyrk. I heavily performance tested pdtrsm and dpotrf before trying pdpotrf.
poulson_jack
Beginner
75 Views

That link doesn't work.
Andrey_D_Intel
Employee
75 Views

Could you give more details on cluster configuration? I'd like to understand why you was not able to use mpdboot to launch MPD ring.
Andrey_D_Intel
Employee
75 Views

I did a misprint. Sorry. The right link is https://premier.intel.com
poulson_jack
Beginner
75 Views

The test cluster consists of 2 4 processor machines behind a firewall. The headnode, nerf, has two ethernet ports, one connected to the firewall, one to the node, ball. All IPs are in the 10.0.0.0 network.

When I try:
mpdboot --totalnum=2 --file=./mpd.hosts --rsh=ssh

the output is:
mpdboot_nerf (handle_mpd_output 681): failed to ping mpd on ball; received output={}

Also, the premier support link won't let me in, as I'm only evaluating the software right now.
Andrey_D_Intel
Employee
75 Views

  1. Is it allowed to establish connection from compute nodes to the head node?mpdboot alwaysstart mpd daemon on local node first .After that remote mpd daemonst attemt to perfrom connection to it?
  2. Do you able to start mpd manually?
    • Run the mpd -e -d command on the head node. The port number will be printed on stdout.
    • Run the mpd -h head_node -p -d command to establish MPD ring. Use the port number printed at pervious step.
    • Check if ring was established succesfully. Run the mpdtrace command for that.

Andrey_D_Intel
Employee
75 Views

Ops! I see that you can start ring manually. Could you share the content of your mpd.hosts file? Could you share the output from mpdboot -d -v... command? Is there any useful information in /tmp/mpd2.logfile_
poulson_jack
Beginner
75 Views

After reconfiguring the network settings several times, and reorganizing all of my environment variables (I had several MPI implementations installed), the problem went away, and I could boot up the MPD daemons via:
mpdboot --file= --rsh=ssh

I wish I could explain more specifically, but I changed far too many things in the process of compiling ScaLAPACK from scratch for several MPI implementations.
Reply