Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6975 Discussions

segmentaion fault with SCALAPACK pcheevx example - please help!

aleksey239
Beginner
463 Views
Hi,

I have recently installed linux (ubuntu 10.04), latest intel fortran Pro (11.1/072) and OpenMPI.

I can tell more details (in addition to what is below), but do not know what details to post and how to get them from the system.

I am trying to run the SCALAPACK examples.

The simpliest example1.f from http://www.netlib.org/scalapack/examples/ works fine.

I then tried to run the example
sample_pcheevx_call.f

on the same system. It does NOT work and gives me the following segmentation fault error:


Compilation:

ifort -I/opt/intel/Compiler/11.1/072/mkl/include -c sample_pcheevx_call.f
mpif90 -L/opt/intel/Compiler/11.1/072/mkl/lib/em64t -Wl,--start-group -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -Wl,--end-group sample_pcheevx_call.o -o a.out

Running:

mpirun -H aa-laptop,aa-laptop ./a.out
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libmkl_scalapack_ 00007F1845CBE549 Unknown Unknown Unknown
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 2455 on
node aa-laptop exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------


My system is:

uname -a
Linux aa-laptop 2.6.32-23-generic #37-Ubuntu SMP Fri Jun 11 08:03:28 UTC 2010 x86_64 GNU/Linux


Please let me know what further information I need to post and how to get it from the system.

Many thanks,

Aleksey





0 Kudos
7 Replies
Gennady_F_Intel
Moderator
463 Views
What OpenMPI version you use?
--Gennady
0 Kudos
aleksey239
Beginner
463 Views
What OpenMPI version you use?
--Gennady

Hi Gennady,

I am using openmpi-1.4.2.

Aleksey

0 Kudos
Gennady_F_Intel
Moderator
463 Views
Aleksey, please check Release notes - what openmpi versions have been validated with the version on MKL you use. if I am not mistaken - it is 1.2.*
it may be the reason of this problem.
--Gennady
0 Kudos
aleksey239
Beginner
463 Views
Aleksey, please check Release notes - what openmpi versions have been validated with the version on MKL you use. if I am not mistaken - it is 1.2.*
it may be the reason of this problem.
--Gennady

Gennady,

many thanks for the suggestion. Yes, you are right, the installed version of the MKL 10.2 Update 5 for Linux* OS was only tested for Open MPI 1.2.x.

I therefore downloaded and installed the Open MPI 1.2.9.

The output for the same compilation, but with Open MPI 1.2.9, is the following:

----------------

mpirun -H aa-laptop,aa-laptop ./a.out
[aa-laptop:08655] *** Process received signal ***
[aa-laptop:08655] Signal: Segmentation fault (11)
[aa-laptop:08655] Signal code: Address not mapped (1)
[aa-laptop:08655] Failing at address: 0x1
[aa-laptop:08655] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f660844a8f0]
[aa-laptop:08655] [ 1] /opt/intel/Compiler/11.1/072/mkl/lib/em64t/libmkl_scalapack_lp64.so(pclaprnt_print9999_+0x9) [0x7f660aea4549]
[aa-laptop:08655] *** End of error message ***
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libopen-pal.so.0 00007F094CFF5E52 Unknown Unknown Unknown
libmpi.so.0 00007F094D4DE7B5 Unknown Unknown Unknown
mca_coll_tuned.so 00007F09458F2975 Unknown Unknown Unknown
mca_coll_tuned.so 00007F09458F724F Unknown Unknown Unknown
libmpi.so.0 00007F094D4F3487 Unknown Unknown Unknown
a.out 0000000000418465 Unknown Unknown Unknown
libmkl_scalapack_ 00007F094EFA2CD5 Unknown Unknown Unknown
mpirun noticed that job rank 0 with PID 8655 on node aa-laptop exited on signal 11 (Segmentation fault).

--------

I would really appreciate if you can give me any further suggestions...

Many thanks,

Aleksey

0 Kudos
aleksey239
Beginner
463 Views
Aleksey, please check Release notes - what openmpi versions have been validated with the version on MKL you use. if I am not mistaken - it is 1.2.*
it may be the reason of this problem.
--Gennady

Gennady,

I have now tried to run the same example on our cambridge cluster (www.hpc.cam.ac.uk). I have decided to use intel MPI to avoid possible problems with Open MPI.

Below is what I did and what is the result.

Firstly I added the required modules - I post it below in order to show what versions of ifort, MKL and IMPI I am using:

bindloe02 pcheevx]$ module list
Currently Loaded Modulefiles:
1) dot 4) gold/2.1.6.0 7) mpiexec/0.82 10) intel/fce/11.0.081 13) intel/impi/4.0.0.028
2) torque 5) java/jdk1.6.0_20 8) global 11) intel/mkl/10.2.2.025
3) moab 6) infinipath/mpi/2.3.1 9) intel/cce/11.0.081 12) default-infinipath

Then I compiled the same sample program:

@bindloe02 pcheevx]$ make a.out
ifort -c sample_pcheevx_call.f
mpiifort -Wl,--start-group -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -Wl,--end-group sample_pcheevx_call.o -o a.out

I then tried to run:

@bindloe02 pcheevx]$ mpirun -np 2 ./a.out
WARNING: Unable to read mpd.hosts or list of hosts isn't provided. MPI job will be run on the current machine only.
forrtl: severe (174): SIGSEGV, segmentation fault occurred
rank 0 in job 1 bindloe02_40727 caused collective abort of all ranks
exit status of rank 0: return code 174

I would really appreciate if you can help to find out what is wrong.

Meanwhile I will try to install MPICH2 version 1.0.x on my machine and will see whether that will change the results compared to Open MPI 1.2.9.

Many thanks,

Aleksey

0 Kudos
aleksey239
Beginner
463 Views
Hi,

I have now installed MPICH2 and tried the same program with MPICH2. It does not work: again, it is the segmentation fault. Here are the details:

pcheevx$ make a.out
ifort -I/opt/intel/Compiler/11.1/072/mkl/include -C -O0 -c sample_pcheevx_call.f
mpif90 -L/opt/intel/Compiler/11.1/072/mkl/lib/em64t -Wl,--start-group -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -Wl,--end-group sample_pcheevx_call.o -o a.out

pcheevx$ mpiexec -np 2 ./a.out
/opt/mpich2-1.0.8/bin/mpdlib.py:8: DeprecationWarning: The popen2 module is deprecated. Use the subprocess module.
import sys, os, signal, popen2, socket, select, inspect
/opt/mpich2-1.0.8/bin/mpdlib.py:15: DeprecationWarning: the md5 module is deprecated; use hashlib instead
from md5 import new as md5new
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libmkl_scalapack_ 00007F89E60B2549 Unknown Unknown Unknown
rank 0 in job 2 aa-laptop_47044 caused collective abort of all ranks
exit status of rank 0: return code 174

I would really hope that it is possible to find a solution! All I want is to be able to run pcheevx at least under some MPI.

Meanwhile my steps would be to learn how to debug the example in parallel; then I will try to copy the example in another program; comment out everything and then add bit by bit to see when the segmentation fault will actually appear. I have read somewhere that may be it is not the call to PCHEEVX but the call to something like BLACS_GRIDINIT is the problem....

Many thanks,

Aleksey



0 Kudos
aleksey239
Beginner
463 Views
Hi,

I have now tried to use compiler options to understand why the PCHEEVX example produces segmentation fault.

I used OpenMPI and got the following:

pcheevx$ make a.out
ifort -check bounds -traceback -warn -O0 -c sample_pcheevx_call.f
mpif90 -L/opt/intel/Compiler/11.1/072/mkl/lib/em64t -Wl,--start-group -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -Wl,--end-group sample_pcheevx_call.o -o a.out

pcheevx$ mpirun -H aa-laptop,aa-laptop ./a.out
[aa-laptop:03683] *** Process received signal ***
[aa-laptop:03683] Signal: Segmentation fault (11)
[aa-laptop:03683] Signal code: Address not mapped (1)
[aa-laptop:03683] Failing at address: 0x1
[aa-laptop:03683] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f6aa5bd98f0]
[aa-laptop:03683] [ 1] /opt/intel/Compiler/11.1/072/mkl/lib/em64t/libmkl_scalapack_lp64.so(pclaprnt_print9999_+0x9) [0x7f6aa8633549]
[aa-laptop:03683] *** End of error message ***
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
mca_btl_sm.so 00007F45A9B11C17 Unknown Unknown Unknown
mca_bml_r2.so 00007F45A9F1E1DA Unknown Unknown Unknown
libopen-pal.so.0 00007F45B07E6E8A Unknown Unknown Unknown
libmpi.so.0 00007F45B0CCF7B5 Unknown Unknown Unknown
mca_coll_tuned.so 00007F45A90E3975 Unknown Unknown Unknown
mca_coll_tuned.so 00007F45A90E824F Unknown Unknown Unknown
libmpi.so.0 00007F45B0CE4487 Unknown Unknown Unknown
a.out 0000000000418465 Unknown Unknown Unknown
libmkl_scalapack_ 00007F45B2793CD5 Unknown Unknown Unknown
mpirun noticed that job rank 0 with PID 3683 on node aa-laptop exited on signal 11 (Segmentation fault).


The above gives more information, but I do not know how to use it. Can anyone help?

Many thanks,

Aleksey
0 Kudos
Reply