impi/mkl_scalapack causing kernel panic

poulson_jack · ‎01-21-2008

I'm testing out the cluster toolkit for Linux on em64t and the scalapack Cholesky factorization routine (pdpotrf_) causes a Machine Check Exception and kernel panic very predictably. No other scalapack/pblas routine so far has caused any problems.

Additionally, each time I start an mpi job using 'mpiexec', the mpd daemon outputs the following:
unable to parse pmi message from the process :cmd=put kvsname=kvs_nerf_4268_1_0 key=DAPL_PROVIDER value=

It may not matter, but the only way I've found to get the mpd ring running is:
[host]$ mpd --ifhn=10.0.0.1 -l 4268 &
[node]$ mpd -h nerf -p 4268 &

Any help clearing these up would be greatly appreciated, as I won't be purchasing the software otherwise. MCEs are unacceptable.

poulson_jack · ‎01-21-2008

The forum pulled out the last part of the error message: "value="

poulson_jack · ‎01-21-2008

In brackets, "NULL string"

Andrey_D_Intel · ‎01-24-2008

Hi,

Could you clarify the Intel MPI Library version you use? Please check package ID information in the mpisupport.txt file.

By the way, you can fill a bug report at https://primer.intel.com to get a technical assistance.

Best regards,

Andrey

poulson_jack · ‎01-24-2008

Package ID: l_mpi_p_3.1.026

If I had to guess, I would say it's in pdsyrk. I heavily performance tested pdtrsm and dpotrf before trying pdpotrf.

poulson_jack · ‎01-25-2008

That link doesn't work.

Andrey_D_Intel · ‎01-25-2008

Could you give more details on cluster configuration? I'd like to understand why you was not able to use mpdboot to launch MPD ring.

Andrey_D_Intel · ‎01-25-2008

I did a misprint. Sorry. The right link is https://premier.intel.com

poulson_jack · ‎01-25-2008

The test cluster consists of 2 4 processor machines behind a firewall. The headnode, nerf, has two ethernet ports, one connected to the firewall, one to the node, ball. All IPs are in the 10.0.0.0 network.

When I try:
mpdboot --totalnum=2 --file=./mpd.hosts --rsh=ssh

the output is:
mpdboot_nerf (handle_mpd_output 681): failed to ping mpd on ball; received output={}

Also, the premier support link won't let me in, as I'm only evaluating the software right now.

Andrey_D_Intel · ‎01-30-2008

Is it allowed to establish connection from compute nodes to the head node?mpdboot alwaysstart mpd daemon on local node first .After that remote mpd daemonst attemt to perfrom connection to it?
Do you able to start mpd manually?

Run the mpd -e -d command on the head node. The port number will be printed on stdout.
Run the mpd -h head_node -p -d command to establish MPD ring. Use the port number printed at pervious step.
Check if ring was established succesfully. Run the mpdtrace command for that.

Andrey_D_Intel · ‎01-30-2008

Ops! I see that you can start ring manually. Could you share the content of your mpd.hosts file? Could you share the output from mpdboot -d -v... command? Is there any useful information in /tmp/mpd2.logfile_

poulson_jack · ‎02-01-2008

After reconfiguring the network settings several times, and reorganizing all of my environment variables (I had several MPI implementations installed), the problem went away, and I could boot up the MPD daemons via:
mpdboot --file= --rsh=ssh

I wish I could explain more specifically, but I changed far too many things in the process of compiling ScaLAPACK from scratch for several MPI implementations.