- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have recently installed linux (ubuntu 10.04), latest intel fortran Pro (11.1/072) and OpenMPI.
I can tell more details (in addition to what is below), but do not know what details to post and how to get them from the system.
I am trying to run the SCALAPACK examples.
The simpliest example1.f from http://www.netlib.org/scalapack/examples/ works fine.
I then tried to run the example
sample_pcheevx_call.f
on the same system. It does NOT work and gives me the following segmentation fault error:
Compilation:
ifort -I/opt/intel/Compiler/11.1/072/mkl/include -c sample_pcheevx_call.f
mpif90 -L/opt/intel/Compiler/11.1/072/mkl/lib/em64t -Wl,--start-group -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -Wl,--end-group sample_pcheevx_call.o -o a.out
Running:
mpirun -H aa-laptop,aa-laptop ./a.out
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libmkl_scalapack_ 00007F1845CBE549 Unknown Unknown Unknown
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 2455 on
node aa-laptop exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
My system is:
uname -a
Linux aa-laptop 2.6.32-23-generic #37-Ubuntu SMP Fri Jun 11 08:03:28 UTC 2010 x86_64 GNU/Linux
Please let me know what further information I need to post and how to get it from the system.
Many thanks,
Aleksey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Gennady,
I am using openmpi-1.4.2.
Aleksey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Gennady,
many thanks for the suggestion. Yes, you are right, the installed version of the MKL 10.2 Update 5 for Linux* OS was only tested for Open MPI 1.2.x.
I therefore downloaded and installed the Open MPI 1.2.9.
The output for the same compilation, but with Open MPI 1.2.9, is the following:
----------------
mpirun -H aa-laptop,aa-laptop ./a.out
[aa-laptop:08655] *** Process received signal ***
[aa-laptop:08655] Signal: Segmentation fault (11)
[aa-laptop:08655] Signal code: Address not mapped (1)
[aa-laptop:08655] Failing at address: 0x1
[aa-laptop:08655] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f660844a8f0]
[aa-laptop:08655] [ 1] /opt/intel/Compiler/11.1/072/mkl/lib/em64t/libmkl_scalapack_lp64.so(pclaprnt_print9999_+0x9) [0x7f660aea4549]
[aa-laptop:08655] *** End of error message ***
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libopen-pal.so.0 00007F094CFF5E52 Unknown Unknown Unknown
libmpi.so.0 00007F094D4DE7B5 Unknown Unknown Unknown
mca_coll_tuned.so 00007F09458F2975 Unknown Unknown Unknown
mca_coll_tuned.so 00007F09458F724F Unknown Unknown Unknown
libmpi.so.0 00007F094D4F3487 Unknown Unknown Unknown
a.out 0000000000418465 Unknown Unknown Unknown
libmkl_scalapack_ 00007F094EFA2CD5 Unknown Unknown Unknown
mpirun noticed that job rank 0 with PID 8655 on node aa-laptop exited on signal 11 (Segmentation fault).
--------
I would really appreciate if you can give me any further suggestions...
Many thanks,
Aleksey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Gennady,
I have now tried to run the same example on our cambridge cluster (www.hpc.cam.ac.uk). I have decided to use intel MPI to avoid possible problems with Open MPI.
Below is what I did and what is the result.
Firstly I added the required modules - I post it below in order to show what versions of ifort, MKL and IMPI I am using:
bindloe02 pcheevx]$ module list
Currently Loaded Modulefiles:
1) dot 4) gold/2.1.6.0 7) mpiexec/0.82 10) intel/fce/11.0.081 13) intel/impi/4.0.0.028
2) torque 5) java/jdk1.6.0_20 8) global 11) intel/mkl/10.2.2.025
3) moab 6) infinipath/mpi/2.3.1 9) intel/cce/11.0.081 12) default-infinipath
Then I compiled the same sample program:
@bindloe02 pcheevx]$ make a.out
ifort -c sample_pcheevx_call.f
mpiifort -Wl,--start-group -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -Wl,--end-group sample_pcheevx_call.o -o a.out
I then tried to run:
@bindloe02 pcheevx]$ mpirun -np 2 ./a.out
WARNING: Unable to read mpd.hosts or list of hosts isn't provided. MPI job will be run on the current machine only.
forrtl: severe (174): SIGSEGV, segmentation fault occurred
rank 0 in job 1 bindloe02_40727 caused collective abort of all ranks
exit status of rank 0: return code 174
I would really appreciate if you can help to find out what is wrong.
Meanwhile I will try to install MPICH2 version 1.0.x on my machine and will see whether that will change the results compared to Open MPI 1.2.9.
Many thanks,
Aleksey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have now installed MPICH2 and tried the same program with MPICH2. It does not work: again, it is the segmentation fault. Here are the details:
pcheevx$ make a.out
ifort -I/opt/intel/Compiler/11.1/072/mkl/include -C -O0 -c sample_pcheevx_call.f
mpif90 -L/opt/intel/Compiler/11.1/072/mkl/lib/em64t -Wl,--start-group -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -Wl,--end-group sample_pcheevx_call.o -o a.out
pcheevx$ mpiexec -np 2 ./a.out
/opt/mpich2-1.0.8/bin/mpdlib.py:8: DeprecationWarning: The popen2 module is deprecated. Use the subprocess module.
import sys, os, signal, popen2, socket, select, inspect
/opt/mpich2-1.0.8/bin/mpdlib.py:15: DeprecationWarning: the md5 module is deprecated; use hashlib instead
from md5 import new as md5new
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libmkl_scalapack_ 00007F89E60B2549 Unknown Unknown Unknown
rank 0 in job 2 aa-laptop_47044 caused collective abort of all ranks
exit status of rank 0: return code 174
I would really hope that it is possible to find a solution! All I want is to be able to run pcheevx at least under some MPI.
Meanwhile my steps would be to learn how to debug the example in parallel; then I will try to copy the example in another program; comment out everything and then add bit by bit to see when the segmentation fault will actually appear. I have read somewhere that may be it is not the call to PCHEEVX but the call to something like BLACS_GRIDINIT is the problem....
Many thanks,
Aleksey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have now tried to use compiler options to understand why the PCHEEVX example produces segmentation fault.
I used OpenMPI and got the following:
pcheevx$ make a.out
ifort -check bounds -traceback -warn -O0 -c sample_pcheevx_call.f
mpif90 -L/opt/intel/Compiler/11.1/072/mkl/lib/em64t -Wl,--start-group -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -Wl,--end-group sample_pcheevx_call.o -o a.out
pcheevx$ mpirun -H aa-laptop,aa-laptop ./a.out
[aa-laptop:03683] *** Process received signal ***
[aa-laptop:03683] Signal: Segmentation fault (11)
[aa-laptop:03683] Signal code: Address not mapped (1)
[aa-laptop:03683] Failing at address: 0x1
[aa-laptop:03683] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f6aa5bd98f0]
[aa-laptop:03683] [ 1] /opt/intel/Compiler/11.1/072/mkl/lib/em64t/libmkl_scalapack_lp64.so(pclaprnt_print9999_+0x9) [0x7f6aa8633549]
[aa-laptop:03683] *** End of error message ***
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
mca_btl_sm.so 00007F45A9B11C17 Unknown Unknown Unknown
mca_bml_r2.so 00007F45A9F1E1DA Unknown Unknown Unknown
libopen-pal.so.0 00007F45B07E6E8A Unknown Unknown Unknown
libmpi.so.0 00007F45B0CCF7B5 Unknown Unknown Unknown
mca_coll_tuned.so 00007F45A90E3975 Unknown Unknown Unknown
mca_coll_tuned.so 00007F45A90E824F Unknown Unknown Unknown
libmpi.so.0 00007F45B0CE4487 Unknown Unknown Unknown
a.out 0000000000418465 Unknown Unknown Unknown
libmkl_scalapack_ 00007F45B2793CD5 Unknown Unknown Unknown
mpirun noticed that job rank 0 with PID 3683 on node aa-laptop exited on signal 11 (Segmentation fault).
The above gives more information, but I do not know how to use it. Can anyone help?
Many thanks,
Aleksey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page