Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
1918 Discussions

impi/mkl_scalapack causing kernel panic

poulson_jack
Beginner
210 Views
I'm testing out the cluster toolkit for Linux on em64t and the scalapack Cholesky factorization routine (pdpotrf_) causes a Machine Check Exception and kernel panic very predictably. No other scalapack/pblas routine so far has caused any problems.

Additionally, each time I start an mpi job using 'mpiexec', the mpd daemon outputs the following:
unable to parse pmi message from the process :cmd=put kvsname=kvs_nerf_4268_1_0 key=DAPL_PROVIDER value=

It may not matter, but the only way I've found to get the mpd ring running is:
[host]$ mpd --ifhn=10.0.0.1 -l 4268 &
[node]$ mpd -h nerf -p 4268 &

Any help clearing these up would be greatly appreciated, as I won't be purchasing the software otherwise. MCEs are unacceptable.
0 Kudos
11 Replies
poulson_jack
Beginner
210 Views
The forum pulled out the last part of the error message: "value="
poulson_jack
Beginner
210 Views
In brackets, "NULL string"
Andrey_D_Intel
Employee
210 Views

Hi,

Could you clarify the Intel MPI Library version you use? Please check package ID information in the mpisupport.txt file.

By the way, you can fill a bug report at https://primer.intel.com to get a technical assistance.

Best regards,

Andrey

poulson_jack
Beginner
210 Views
Package ID: l_mpi_p_3.1.026

If I had to guess, I would say it's in pdsyrk. I heavily performance tested pdtrsm and dpotrf before trying pdpotrf.
poulson_jack
Beginner
210 Views
That link doesn't work.
Andrey_D_Intel
Employee
210 Views
Could you give more details on cluster configuration? I'd like to understand why you was not able to use mpdboot to launch MPD ring.
Andrey_D_Intel
Employee
210 Views
I did a misprint. Sorry. The right link is https://premier.intel.com
poulson_jack
Beginner
210 Views
The test cluster consists of 2 4 processor machines behind a firewall. The headnode, nerf, has two ethernet ports, one connected to the firewall, one to the node, ball. All IPs are in the 10.0.0.0 network.

When I try:
mpdboot --totalnum=2 --file=./mpd.hosts --rsh=ssh

the output is:
mpdboot_nerf (handle_mpd_output 681): failed to ping mpd on ball; received output={}

Also, the premier support link won't let me in, as I'm only evaluating the software right now.
Andrey_D_Intel
Employee
210 Views
  1. Is it allowed to establish connection from compute nodes to the head node?mpdboot alwaysstart mpd daemon on local node first .After that remote mpd daemonst attemt to perfrom connection to it?
  2. Do you able to start mpd manually?
    • Run the mpd -e -d command on the head node. The port number will be printed on stdout.
    • Run the mpd -h head_node -p -d command to establish MPD ring. Use the port number printed at pervious step.
    • Check if ring was established succesfully. Run the mpdtrace command for that.

Andrey_D_Intel
Employee
210 Views
Ops! I see that you can start ring manually. Could you share the content of your mpd.hosts file? Could you share the output from mpdboot -d -v... command? Is there any useful information in /tmp/mpd2.logfile_
poulson_jack
Beginner
210 Views
After reconfiguring the network settings several times, and reorganizing all of my environment variables (I had several MPI implementations installed), the problem went away, and I could boot up the MPD daemons via:
mpdboot --file= --rsh=ssh

I wish I could explain more specifically, but I changed far too many things in the process of compiling ScaLAPACK from scratch for several MPI implementations.
Reply