Community
cancel
Showing results for 
Search instead for 
Did you mean: 
daren__wall
Beginner
81 Views

Problem when solving large system using Scalapack PDGESV

A parallel fortran code that solves a set of linear simultaneous equations Ax = b using the scalapack routine PDGESV fails (exiting with segmentation fault) when the no. of equations, N,  becomes large.  I have not identified the exact value of N at which problems arise, but, for example, the code works for all the values I have tested up to N= 50000, but fails at N=94423.

In particular, the failure appears to occur during the call to the scalapack routine (i.e. not when allocating / deallocating memory);
it enters routine PDGESV, but does not leave this routine.

I have prepared a simple small Fortran example code (see attachment below) that exhibits this problem.  This code simply 1) allocates space for the matrix A and vector b, 2) fills their entries with random entries 3) calls PDGESV and then 4) deallocates the memory. The code has been tested on a variety of different matrix sizes (NxN) and with various BLACS processor arrays without any errors until N becomes large. 

The problem does not seem to be a problem with lack of memory; on the machine I execute the code 192 GB is available,

whereas the code only uses 65 GB when N=94423. I have tried using the 'ulimit -s unlimited' command , but this did not resolve the problem. My feeling is that instead there is some problem with maybe exceeding some default limit on what memory is available to a single process in mpi? i.e. perhaps I am simply missing some appropriate FLAGS at compilation / run time?

I am running the program on a linux cluster using  Red Hat Enterprise Linux Server release 7.3 (Maipo)

I compiled the following code with:

mpiifort -mcmodel=medium    -m64  -mkl=cluster  -o para.exe  solve_by_lu_parallelmpi_simple_light2.for

 

and run it using (for example when N= 9445)

mpiexec.hydra  -n 4 ./para.exe  9445 2 2 32

the command line arguments here denote selecting N=9445 and using a 2x2 BLACS process array with block size 32

For this smaller matrix size the program runs w/out any problems producing the output

WE ARE SOLVING A SYSTEM OF         9445  LINEAR EQUATIONS
 PROC:            0           0 HAS  MLOC, NLOC =        4736        4736
 PROC:            0           0  ALLOCATING SPACE ...
 PROC:            0           0  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            0           1 HAS  MLOC, NLOC =        4736        4709
 PROC:            0           1  ALLOCATING SPACE ...
 PROC:            0           1  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            1           0 HAS  MLOC, NLOC =        4709        4736
 PROC:            1           0  ALLOCATING SPACE ...
 PROC:            1           0  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            1           1 HAS  MLOC, NLOC =        4709        4709
 PROC:            1           1  ALLOCATING SPACE ...
 PROC:            1           1  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            1           1
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
 PROC:            1           0
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
 PROC:            0           1
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
 PROC:            0           0
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
 
 INFO code returned by PDGESV =            0

SO far so good. But when I try to solve a larger system using

mpiexec.hydra -n $NUM_PROCS ./para.exe  9445 2 2 32

the program crashes during the call to PDGESV with the output

WE ARE SOLVING A SYSTEM OF        94423  LINEAR EQUATIONS
 PROC:            0           0 HAS  MLOC, NLOC =       47223       47223
 PROC:            0           0  ALLOCATING SPACE ...
 PROC:            0           0  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            0           1 HAS  MLOC, NLOC =       47223       47200
 PROC:            0           1  ALLOCATING SPACE ...
 PROC:            0           1  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            1           0 HAS  MLOC, NLOC =       47200       47223
 PROC:            1           0  ALLOCATING SPACE ...
 PROC:            1           1 HAS  MLOC, NLOC =       47200       47200
 PROC:            1           1  ALLOCATING SPACE ...
 PROC:            1           0  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            1           1  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            0           1
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
 PROC:            0           0
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
 PROC:            1           1
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
 PROC:            1           0
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..


forrtl: 致命的なエラー (154): 配列インデックスが境界外です。
Image              PC                Routine            Line        Source             
libifcore.so.5     00002B0D716C19AF  for__signal_handl     Unknown  Unknown
libpthread-2.17.s  00002B0D712335D0  Unknown               Unknown  Unknown
libmkl_avx512.so   00002B11A45E5A47  mkl_blas_avx512_x     Unknown  Unknown
libmkl_intel_lp64  00002B0D68E8BB55  dger_                 Unknown  Unknown
libmkl_scalapack_  00002B0D69F972AE  pdger_                Unknown  Unknown
libmkl_scalapack_  00002B0D69E53541  pdgetf3_              Unknown  Unknown
libmkl_scalapack_  00002B0D69E53688  pdgetf3_              Unknown  Unknown
libmkl_scalapack_  00002B0D69C2E13B  pdgetf2_              Unknown  Unknown
libmkl_scalapack_  00002B0D69C2E836  pdgetrf2_             Unknown  Unknown
libmkl_scalapack_  00002B0D6A014F6E  pdgetrf_              Unknown  Unknown
libmkl_scalapack_  00002B0D69C29C7D  pdgesv_               Unknown  Unknown
para.exe           0000000000401F8C  Unknown               Unknown  Unknown
para.exe           00000000004011BE  Unknown               Unknown  Unknown
libc-2.17.so       00002B0D73DFC3D5  __libc_start_main     Unknown  Unknown
para.exe           00000000004010C9  Unknown               Unknown  Unknown

the first error line beginning forrtl: can be translated as

forrtl: Fatal error (154): Array index out of bounds.

The problem seems to be ocurring somewhere in the scalapack routines.

Does anyone have any recommendations / possible solutions ?

 Any advice or pointers will be gratefully received,

     Many thanks,

             Dan.

 

 

0 Kudos
4 Replies
Gennady_F_Intel
Moderator
81 Views

please try to link with ILP64 API and recheck the behavior on your side

daren__wall
Beginner
81 Views

Hi there,

I have now compiled instead with
 mpiifort -mcmodel=medium    -m64 -ilp64   -mkl=cluster  -o para.exe  solve_by_lu_parallelmpi_simple_light2.for

but unfortunately seem to get  a similar  error (again the error occurs somewhere within the call to PDGESV):


WE ARE SOLVING A SYSTEM OF        94423  LINEAR EQUATIONS
 PROC:            0           0 HAS  MLOC, NLOC =       47223       47223
 PROC:            0           0  ALLOCATING SPACE ...
 PROC:            0           1 HAS  MLOC, NLOC =       47223       47200
 PROC:            0           1  ALLOCATING SPACE ...
 PROC:            1           0 HAS  MLOC, NLOC =       47200       47223
 PROC:            1           0  ALLOCATING SPACE ...
 PROC:            1           1 HAS  MLOC, NLOC =       47200       47200
 PROC:            1           1  ALLOCATING SPACE ...
 PROC:            1           0  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            1           1  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            0           0  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            0           1  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            0           0
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
 PROC:            0           1
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
 PROC:            1           1
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
 PROC:            1           0
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
forrtl: 致命的なエラー (154): 配列インデックスが境界外です。
Image              PC                Routine            Line        Source             
libifcore.so.5     00002B009E4AC9AF  for__signal_handl     Unknown  Unknown
libpthread-2.17.s  00002B009E01E5D0  Unknown               Unknown  Unknown
libmkl_avx512.so   00002B04D38E0A47  mkl_blas_avx512_x     Unknown  Unknown
libmkl_intel_lp64  00002B0095A41B55  dger_                 Unknown  Unknown
libmkl_scalapack_  00002B0096B4D2AE  pdger_                Unknown  Unknown
libmkl_scalapack_  00002B0096A09541  pdgetf3_              Unknown  Unknown
libmkl_scalapack_  00002B0096A09688  pdgetf3_              Unknown  Unknown
libmkl_scalapack_  00002B00967E413B  pdgetf2_              Unknown  Unknown
libmkl_scalapack_  00002B00967E4836  pdgetrf2_             Unknown  Unknown
libmkl_scalapack_  00002B0096BCAF6E  pdgetrf_              Unknown  Unknown
libmkl_scalapack_  00002B00967DFC7D  pdgesv_               Unknown  Unknown
para.exe           0000000000401F9C  Unknown               Unknown  Unknown
para.exe           00000000004011CE  Unknown               Unknown  Unknown
libc-2.17.so       00002B00A0BE73D5  __libc_start_main     Unknown  Unknown
para.exe           00000000004010D9  Unknown               Unknown  Unknown

 

Gennady_F_Intel
Moderator
81 Views

this is not exactly what I mean when asked to check if the problem exists with ilp64 API. Please take a look whet mkl linker adviser will suggest how to properly link with ilp64 cases.

daren__wall
Beginner
81 Views

Many thanks for your suggestion re: the mkl link adviser.

There were a few possible choices of how to link the code; I had an idea that dynamic linking with

openMP threading may be the best option, but I compiled and executed a number of possible options (10 in all).

The good news I can report is that actually all ten choices listed below led to successful execution ; problem solved!

I write the actual compilation commands below along with the execution wall time in case they should be 

of interest to other programmers.  The conclusions are that openMP (unsurprisingly) offers a significant speedup over sequential ,

 a dynamically linked code will slightly outperform a statically linked code all other options being the same.

For those  intending to call PDGESV from their codes, I believe the fortran program attached above makes a good compact scalable test program, please use it freely. 

Many thanks once again for you assistance, it is much appreciated. Perhaps you could offer a concise sentence

just to explain why the different linking options used below, as suggested by the link adviser, led to a resolution of the problem

- is it fair to say it was a large integer problem ?

---------------------------------------------------------------------------------

Compilation and execution times (using intel compiler version 18.3):

[1]   (we add -mcmodel=medium to the link adviser suggestion, dynamic linking, openmp and explicit linking to mkl )

Execution wall clock time: 18 mins  5 secs
 mpiifort -mcmodel=medium      -i8 -I${MKLROOT}/include  -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_ilp64 -liomp5 -lpthread -lm -ldl  -o para01.exe  solve_by_lu_parallelmpi_simple_light2.for     -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_ilp64 -liomp5 -lpthread -lm -ldl

 

[2]  (we  add mcmodel and m64; dynamic linking , openmp and explicit linking to mkl )

Execution wall clock time: 18 mins 2 secs

mpiifort -mcmodel=medium    -m64   -i8 -I${MKLROOT}/include  -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_ilp64 -liomp5 -lpthread -lm -ldl  -o para02.exe  solve_by_lu_parallelmpi_simple_light2.for     -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_ilp64 -liomp5 -lpthread -lm -ldl

 

[3]
 (we only add mcmodel; static linking, openMP and linking explicitly with mkl libraries):

Execution wall clock time: 18 mins 33 secs

 mpiifort  -mcmodel=medium   -i8 -I${MKLROOT}/include  ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl  -o para03.exe  solve_by_lu_parallelmpi_simple_light2.for  ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl

[4] (we only add mcmodel and m64; static linking, openMP and linking explicitly with mkl libraries):

Execution wall clock time: 18 mins 33 secs
 mpiifort  -mcmodel=medium -m64    -i8 -I${MKLROOT}/include  ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl  -o para04.exe  solve_by_lu_parallelmpi_simple_light2.for  ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl  

[5]

(just mcmodel added ; dynamic linking , sequential (no openmp) :

Execution wall clock time: 56 mins 15 secs

mpiifort   -mcmodel=medium   -i8 -I${MKLROOT}/include  -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_ilp64 -lpthread -lm -ldl -o para05.exe  solve_by_lu_parallelmpi_simple_light2.for -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_ilp64 -lpthread -lm -ldl


[6]

(we only add mcmodel and m64; dynamic linking , sequential (no openmp):

Execution wall clock time: timing 56 mins 10 secs

mpiifort   -mcmodel=medium  -m64  -i8 -I${MKLROOT}/include  -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_ilp64 -lpthread -lm -ldl -o para06.exe  solve_by_lu_parallelmpi_simple_light2.for -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_ilp64 -lpthread -lm -ldl


[7]

(just mcmodel added) static linking sequential (no openmp):

Execution wall clock time:  1 hour 5 mins

 mpiifort    -mcmodel=medium   -i8 -I${MKLROOT}/include  ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_sequential.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -lpthread -lm -ldl -o para07.exe  solve_by_lu_parallelmpi_simple_light2.for   ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_sequential.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -lpthread -lm -ldl

[8] (just mcmodel and m64 added; static linking sequential (no openmp):

Execution wall clock time: 1 hour 5 mins 

mpiifort    -mcmodel=medium -m64   -i8 -I${MKLROOT}/include  ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_sequential.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -lpthread -lm -ldl -o para08.exe  solve_by_lu_parallelmpi_simple_light2.for   ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_sequential.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -lpthread -lm -ldl

[9]
(no additions by me; dynamic link, openmp no mcmodel):

Execution wall clock time: 18 mins 3 secs


 mpiifort    -i8 -I${MKLROOT}/include  -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_ilp64 -liomp5 -lpthread -lm -ldl  -o para09.exe  solve_by_lu_parallelmpi_simple_light2.for     -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_ilp64 -liomp5 -lpthread -lm -ldl

[10]

(no additions by me;  static link, openmp  no mcmodel) :

Execution wall clock time: 18 min 30 secs

mpiifort    -i8 -I${MKLROOT}/include  ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl  -o para10.exe  solve_by_lu_parallelmpi_simple_light2.for  ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl

 

 

Reply