Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2228 Discussions

impi/mkl_scalapack causing kernel panic

poulson_jack
Beginner
1,474 Views
I'm testing out the cluster toolkit for Linux on em64t and the scalapack Cholesky factorization routine (pdpotrf_) causes a Machine Check Exception and kernel panic very predictably. No other scalapack/pblas routine so far has caused any problems.

Additionally, each time I start an mpi job using 'mpiexec', the mpd daemon outputs the following:
unable to parse pmi message from the process :cmd=put kvsname=kvs_nerf_4268_1_0 key=DAPL_PROVIDER value=

It may not matter, but the only way I've found to get the mpd ring running is:
[host]$ mpd --ifhn=10.0.0.1 -l 4268 &
[node]$ mpd -h nerf -p 4268 &

Any help clearing these up would be greatly appreciated, as I won't be purchasing the software otherwise. MCEs are unacceptable.
0 Kudos
11 Replies
poulson_jack
Beginner
1,474 Views
The forum pulled out the last part of the error message: "value="
0 Kudos
poulson_jack
Beginner
1,474 Views
In brackets, "NULL string"
0 Kudos
Andrey_D_Intel
Employee
1,474 Views

Hi,

Could you clarify the Intel MPI Library version you use? Please check package ID information in the mpisupport.txt file.

By the way, you can fill a bug report at https://primer.intel.com to get a technical assistance.

Best regards,

Andrey

0 Kudos
poulson_jack
Beginner
1,474 Views
Package ID: l_mpi_p_3.1.026

If I had to guess, I would say it's in pdsyrk. I heavily performance tested pdtrsm and dpotrf before trying pdpotrf.
0 Kudos
poulson_jack
Beginner
1,474 Views
That link doesn't work.
0 Kudos
Andrey_D_Intel
Employee
1,474 Views
Could you give more details on cluster configuration? I'd like to understand why you was not able to use mpdboot to launch MPD ring.
0 Kudos
Andrey_D_Intel
Employee
1,474 Views
I did a misprint. Sorry. The right link is https://premier.intel.com
0 Kudos
poulson_jack
Beginner
1,474 Views
The test cluster consists of 2 4 processor machines behind a firewall. The headnode, nerf, has two ethernet ports, one connected to the firewall, one to the node, ball. All IPs are in the 10.0.0.0 network.

When I try:
mpdboot --totalnum=2 --file=./mpd.hosts --rsh=ssh

the output is:
mpdboot_nerf (handle_mpd_output 681): failed to ping mpd on ball; received output={}

Also, the premier support link won't let me in, as I'm only evaluating the software right now.
0 Kudos
Andrey_D_Intel
Employee
1,474 Views
  1. Is it allowed to establish connection from compute nodes to the head node?mpdboot alwaysstart mpd daemon on local node first .After that remote mpd daemonst attemt to perfrom connection to it?
  2. Do you able to start mpd manually?
    • Run the mpd -e -d command on the head node. The port number will be printed on stdout.
    • Run the mpd -h head_node -p -d command to establish MPD ring. Use the port number printed at pervious step.
    • Check if ring was established succesfully. Run the mpdtrace command for that.

0 Kudos
Andrey_D_Intel
Employee
1,474 Views
Ops! I see that you can start ring manually. Could you share the content of your mpd.hosts file? Could you share the output from mpdboot -d -v... command? Is there any useful information in /tmp/mpd2.logfile_
0 Kudos
poulson_jack
Beginner
1,474 Views
After reconfiguring the network settings several times, and reorganizing all of my environment variables (I had several MPI implementations installed), the problem went away, and I could boot up the MPD daemons via:
mpdboot --file= --rsh=ssh

I wish I could explain more specifically, but I changed far too many things in the process of compiling ScaLAPACK from scratch for several MPI implementations.
0 Kudos
Reply