Community
cancel
Showing results for 
Search instead for 
Did you mean: 
aleksey239
Beginner
101 Views

segmentaion fault with SCALAPACK pcheevx example - please help!

Hi,

I have recently installed linux (ubuntu 10.04), latest intel fortran Pro (11.1/072) and OpenMPI.

I can tell more details (in addition to what is below), but do not know what details to post and how to get them from the system.

I am trying to run the SCALAPACK examples.

The simpliest example1.f from http://www.netlib.org/scalapack/examples/ works fine.

I then tried to run the example
sample_pcheevx_call.f

on the same system. It does NOT work and gives me the following segmentation fault error:


Compilation:

ifort -I/opt/intel/Compiler/11.1/072/mkl/include -c sample_pcheevx_call.f
mpif90 -L/opt/intel/Compiler/11.1/072/mkl/lib/em64t -Wl,--start-group -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -Wl,--end-group sample_pcheevx_call.o -o a.out

Running:

mpirun -H aa-laptop,aa-laptop ./a.out
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libmkl_scalapack_ 00007F1845CBE549 Unknown Unknown Unknown
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 2455 on
node aa-laptop exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------


My system is:

uname -a
Linux aa-laptop 2.6.32-23-generic #37-Ubuntu SMP Fri Jun 11 08:03:28 UTC 2010 x86_64 GNU/Linux


Please let me know what further information I need to post and how to get it from the system.

Many thanks,

Aleksey





0 Kudos
7 Replies
Gennady_F_Intel
Moderator
101 Views

What OpenMPI version you use?
--Gennady
aleksey239
Beginner
101 Views

What OpenMPI version you use?
--Gennady

Hi Gennady,

I am using openmpi-1.4.2.

Aleksey

Gennady_F_Intel
Moderator
101 Views

Aleksey, please check Release notes - what openmpi versions have been validated with the version on MKL you use. if I am not mistaken - it is 1.2.*
it may be the reason of this problem.
--Gennady
aleksey239
Beginner
101 Views

Aleksey, please check Release notes - what openmpi versions have been validated with the version on MKL you use. if I am not mistaken - it is 1.2.*
it may be the reason of this problem.
--Gennady

Gennady,

many thanks for the suggestion. Yes, you are right, the installed version of the MKL 10.2 Update 5 for Linux* OS was only tested for Open MPI 1.2.x.

I therefore downloaded and installed the Open MPI 1.2.9.

The output for the same compilation, but with Open MPI 1.2.9, is the following:

----------------

mpirun -H aa-laptop,aa-laptop ./a.out
[aa-laptop:08655] *** Process received signal ***
[aa-laptop:08655] Signal: Segmentation fault (11)
[aa-laptop:08655] Signal code: Address not mapped (1)
[aa-laptop:08655] Failing at address: 0x1
[aa-laptop:08655] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f660844a8f0]
[aa-laptop:08655] [ 1] /opt/intel/Compiler/11.1/072/mkl/lib/em64t/libmkl_scalapack_lp64.so(pclaprnt_print9999_+0x9) [0x7f660aea4549]
[aa-laptop:08655] *** End of error message ***
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libopen-pal.so.0 00007F094CFF5E52 Unknown Unknown Unknown
libmpi.so.0 00007F094D4DE7B5 Unknown Unknown Unknown
mca_coll_tuned.so 00007F09458F2975 Unknown Unknown Unknown
mca_coll_tuned.so 00007F09458F724F Unknown Unknown Unknown
libmpi.so.0 00007F094D4F3487 Unknown Unknown Unknown
a.out 0000000000418465 Unknown Unknown Unknown
libmkl_scalapack_ 00007F094EFA2CD5 Unknown Unknown Unknown
mpirun noticed that job rank 0 with PID 8655 on node aa-laptop exited on signal 11 (Segmentation fault).

--------

I would really appreciate if you can give me any further suggestions...

Many thanks,

Aleksey

aleksey239
Beginner
101 Views

Aleksey, please check Release notes - what openmpi versions have been validated with the version on MKL you use. if I am not mistaken - it is 1.2.*
it may be the reason of this problem.
--Gennady

Gennady,

I have now tried to run the same example on our cambridge cluster (www.hpc.cam.ac.uk). I have decided to use intel MPI to avoid possible problems with Open MPI.

Below is what I did and what is the result.

Firstly I added the required modules - I post it below in order to show what versions of ifort, MKL and IMPI I am using:

bindloe02 pcheevx]$ module list
Currently Loaded Modulefiles:
1) dot 4) gold/2.1.6.0 7) mpiexec/0.82 10) intel/fce/11.0.081 13) intel/impi/4.0.0.028
2) torque 5) java/jdk1.6.0_20 8) global 11) intel/mkl/10.2.2.025
3) moab 6) infinipath/mpi/2.3.1 9) intel/cce/11.0.081 12) default-infinipath

Then I compiled the same sample program:

@bindloe02 pcheevx]$ make a.out
ifort -c sample_pcheevx_call.f
mpiifort -Wl,--start-group -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -Wl,--end-group sample_pcheevx_call.o -o a.out

I then tried to run:

@bindloe02 pcheevx]$ mpirun -np 2 ./a.out
WARNING: Unable to read mpd.hosts or list of hosts isn't provided. MPI job will be run on the current machine only.
forrtl: severe (174): SIGSEGV, segmentation fault occurred
rank 0 in job 1 bindloe02_40727 caused collective abort of all ranks
exit status of rank 0: return code 174

I would really appreciate if you can help to find out what is wrong.

Meanwhile I will try to install MPICH2 version 1.0.x on my machine and will see whether that will change the results compared to Open MPI 1.2.9.

Many thanks,

Aleksey

aleksey239
Beginner
101 Views

Hi,

I have now installed MPICH2 and tried the same program with MPICH2. It does not work: again, it is the segmentation fault. Here are the details:

pcheevx$ make a.out
ifort -I/opt/intel/Compiler/11.1/072/mkl/include -C -O0 -c sample_pcheevx_call.f
mpif90 -L/opt/intel/Compiler/11.1/072/mkl/lib/em64t -Wl,--start-group -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -Wl,--end-group sample_pcheevx_call.o -o a.out

pcheevx$ mpiexec -np 2 ./a.out
/opt/mpich2-1.0.8/bin/mpdlib.py:8: DeprecationWarning: The popen2 module is deprecated. Use the subprocess module.
import sys, os, signal, popen2, socket, select, inspect
/opt/mpich2-1.0.8/bin/mpdlib.py:15: DeprecationWarning: the md5 module is deprecated; use hashlib instead
from md5 import new as md5new
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libmkl_scalapack_ 00007F89E60B2549 Unknown Unknown Unknown
rank 0 in job 2 aa-laptop_47044 caused collective abort of all ranks
exit status of rank 0: return code 174

I would really hope that it is possible to find a solution! All I want is to be able to run pcheevx at least under some MPI.

Meanwhile my steps would be to learn how to debug the example in parallel; then I will try to copy the example in another program; comment out everything and then add bit by bit to see when the segmentation fault will actually appear. I have read somewhere that may be it is not the call to PCHEEVX but the call to something like BLACS_GRIDINIT is the problem....

Many thanks,

Aleksey



aleksey239
Beginner
101 Views

Hi,

I have now tried to use compiler options to understand why the PCHEEVX example produces segmentation fault.

I used OpenMPI and got the following:

pcheevx$ make a.out
ifort -check bounds -traceback -warn -O0 -c sample_pcheevx_call.f
mpif90 -L/opt/intel/Compiler/11.1/072/mkl/lib/em64t -Wl,--start-group -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -Wl,--end-group sample_pcheevx_call.o -o a.out

pcheevx$ mpirun -H aa-laptop,aa-laptop ./a.out
[aa-laptop:03683] *** Process received signal ***
[aa-laptop:03683] Signal: Segmentation fault (11)
[aa-laptop:03683] Signal code: Address not mapped (1)
[aa-laptop:03683] Failing at address: 0x1
[aa-laptop:03683] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f6aa5bd98f0]
[aa-laptop:03683] [ 1] /opt/intel/Compiler/11.1/072/mkl/lib/em64t/libmkl_scalapack_lp64.so(pclaprnt_print9999_+0x9) [0x7f6aa8633549]
[aa-laptop:03683] *** End of error message ***
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
mca_btl_sm.so 00007F45A9B11C17 Unknown Unknown Unknown
mca_bml_r2.so 00007F45A9F1E1DA Unknown Unknown Unknown
libopen-pal.so.0 00007F45B07E6E8A Unknown Unknown Unknown
libmpi.so.0 00007F45B0CCF7B5 Unknown Unknown Unknown
mca_coll_tuned.so 00007F45A90E3975 Unknown Unknown Unknown
mca_coll_tuned.so 00007F45A90E824F Unknown Unknown Unknown
libmpi.so.0 00007F45B0CE4487 Unknown Unknown Unknown
a.out 0000000000418465 Unknown Unknown Unknown
libmkl_scalapack_ 00007F45B2793CD5 Unknown Unknown Unknown
mpirun noticed that job rank 0 with PID 3683 on node aa-laptop exited on signal 11 (Segmentation fault).


The above gives more information, but I do not know how to use it. Can anyone help?

Many thanks,

Aleksey
Reply