Problem when solving large system using Scalapack PDGESV

daren__wall · ‎05-30-2019

A parallel fortran code that solves a set of linear simultaneous equations Ax = b using the scalapack routine PDGESV fails (exiting with segmentation fault) when the no. of equations, N, becomes large. I have not identified the exact value of N at which problems arise, but, for example, the code works for all the values I have tested up to N= 50000, but fails at N=94423.

In particular, the failure appears to occur during the call to the scalapack routine (i.e. not when allocating / deallocating memory);
it enters routine PDGESV, but does not leave this routine.

I have prepared a simple small Fortran example code (see attachment below) that exhibits this problem. This code simply 1) allocates space for the matrix A and vector b, 2) fills their entries with random entries 3) calls PDGESV and then 4) deallocates the memory. The code has been tested on a variety of different matrix sizes (NxN) and with various BLACS processor arrays without any errors until N becomes large.

The problem does not seem to be a problem with lack of memory; on the machine I execute the code 192 GB is available,

whereas the code only uses 65 GB when N=94423. I have tried using the 'ulimit -s unlimited' command , but this did not resolve the problem. My feeling is that instead there is some problem with maybe exceeding some default limit on what memory is available to a single process in mpi? i.e. perhaps I am simply missing some appropriate FLAGS at compilation / run time?

I am running the program on a linux cluster using Red Hat Enterprise Linux Server release 7.3 (Maipo)

I compiled the following code with:

mpiifort -mcmodel=medium -m64 -mkl=cluster -o para.exe solve_by_lu_parallelmpi_simple_light2.for

and run it using (for example when N= 9445)

mpiexec.hydra -n 4 ./para.exe 9445 2 2 32

the command line arguments here denote selecting N=9445 and using a 2x2 BLACS process array with block size 32

For this smaller matrix size the program runs w/out any problems producing the output

WE ARE SOLVING A SYSTEM OF         9445 LINEAR EQUATIONS
PROC:            0           0 HAS MLOC, NLOC =        4736        4736
PROC:            0           0 ALLOCATING SPACE ...
PROC:            0           0 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            0           1 HAS MLOC, NLOC =        4736        4709
PROC:            0           1 ALLOCATING SPACE ...
PROC:            0           1 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            1           0 HAS MLOC, NLOC =        4709        4736
PROC:            1           0 ALLOCATING SPACE ...
PROC:            1           0 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            1           1 HAS MLOC, NLOC =        4709        4709
PROC:            1           1 ALLOCATING SPACE ...
PROC:            1           1 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            1           1
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC:            1           0
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC:            0           1
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC:            0           0
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..

INFO code returned by PDGESV =            0

SO far so good. But when I try to solve a larger system using

mpiexec.hydra -n $NUM_PROCS ./para.exe 9445 2 2 32

the program crashes during the call to PDGESV with the output

WE ARE SOLVING A SYSTEM OF        94423 LINEAR EQUATIONS
PROC:            0           0 HAS MLOC, NLOC =       47223       47223
PROC:            0           0 ALLOCATING SPACE ...
PROC:            0           0 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            0           1 HAS MLOC, NLOC =       47223       47200
PROC:            0           1 ALLOCATING SPACE ...
PROC:            0           1 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            1           0 HAS MLOC, NLOC =       47200       47223
PROC:            1           0 ALLOCATING SPACE ...
PROC:            1           1 HAS MLOC, NLOC =       47200       47200
PROC:            1           1 ALLOCATING SPACE ...
PROC:            1           0 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            1           1 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            0           1
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC:            0           0
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC:            1           1
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC:            1           0
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..

forrtl: 致命的なエラー (154): 配列インデックスが境界外です。
Image              PC                Routine            Line        Source
libifcore.so.5     00002B0D716C19AF for__signal_handl     Unknown Unknown
libpthread-2.17.s 00002B0D712335D0 Unknown               Unknown Unknown
libmkl_avx512.so   00002B11A45E5A47 mkl_blas_avx512_x     Unknown Unknown
libmkl_intel_lp64 00002B0D68E8BB55 dger_                 Unknown Unknown
libmkl_scalapack_ 00002B0D69F972AE pdger_                Unknown Unknown
libmkl_scalapack_ 00002B0D69E53541 pdgetf3_              Unknown Unknown
libmkl_scalapack_ 00002B0D69E53688 pdgetf3_              Unknown Unknown
libmkl_scalapack_ 00002B0D69C2E13B pdgetf2_              Unknown Unknown
libmkl_scalapack_ 00002B0D69C2E836 pdgetrf2_             Unknown Unknown
libmkl_scalapack_ 00002B0D6A014F6E pdgetrf_              Unknown Unknown
libmkl_scalapack_ 00002B0D69C29C7D pdgesv_               Unknown Unknown
para.exe           0000000000401F8C Unknown               Unknown Unknown
para.exe           00000000004011BE Unknown               Unknown Unknown
libc-2.17.so       00002B0D73DFC3D5 __libc_start_main     Unknown Unknown
para.exe           00000000004010C9 Unknown               Unknown Unknown

the first error line beginning forrtl: can be translated as

forrtl: Fatal error (154): Array index out of bounds.

The problem seems to be ocurring somewhere in the scalapack routines.

Does anyone have any recommendations / possible solutions ?

Any advice or pointers will be gratefully received,

Many thanks,

Dan.

Gennady_F_Intel · ‎05-31-2019

please try to link with ILP64 API and recheck the behavior on your side

daren__wall · ‎06-02-2019

Hi there,

I have now compiled instead with
mpiifort -mcmodel=medium -m64 -ilp64 -mkl=cluster -o para.exe solve_by_lu_parallelmpi_simple_light2.for

but unfortunately seem to get a similar error (again the error occurs somewhere within the call to PDGESV):

WE ARE SOLVING A SYSTEM OF        94423 LINEAR EQUATIONS
PROC:            0           0 HAS MLOC, NLOC =       47223       47223
PROC:            0           0 ALLOCATING SPACE ...
PROC:            0           1 HAS MLOC, NLOC =       47223       47200
PROC:            0           1 ALLOCATING SPACE ...
PROC:            1           0 HAS MLOC, NLOC =       47200       47223
PROC:            1           0 ALLOCATING SPACE ...
PROC:            1           1 HAS MLOC, NLOC =       47200       47200
PROC:            1           1 ALLOCATING SPACE ...
PROC:            1           0 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            1           1 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            0           0 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            0           1 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            0           0
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC:            0           1
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC:            1           1
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC:            1           0
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
forrtl: 致命的なエラー (154): 配列インデックスが境界外です。
Image              PC                Routine            Line        Source
libifcore.so.5     00002B009E4AC9AF for__signal_handl     Unknown Unknown
libpthread-2.17.s 00002B009E01E5D0 Unknown               Unknown Unknown
libmkl_avx512.so   00002B04D38E0A47 mkl_blas_avx512_x     Unknown Unknown
libmkl_intel_lp64 00002B0095A41B55 dger_                 Unknown Unknown
libmkl_scalapack_ 00002B0096B4D2AE pdger_                Unknown Unknown
libmkl_scalapack_ 00002B0096A09541 pdgetf3_              Unknown Unknown
libmkl_scalapack_ 00002B0096A09688 pdgetf3_              Unknown Unknown
libmkl_scalapack_ 00002B00967E413B pdgetf2_              Unknown Unknown
libmkl_scalapack_ 00002B00967E4836 pdgetrf2_             Unknown Unknown
libmkl_scalapack_ 00002B0096BCAF6E pdgetrf_              Unknown Unknown
libmkl_scalapack_ 00002B00967DFC7D pdgesv_               Unknown Unknown
para.exe           0000000000401F9C Unknown               Unknown Unknown
para.exe           00000000004011CE Unknown               Unknown Unknown
libc-2.17.so       00002B00A0BE73D5 __libc_start_main     Unknown Unknown
para.exe           00000000004010D9 Unknown               Unknown Unknown

Gennady_F_Intel · ‎06-02-2019

this is not exactly what I mean when asked to check if the problem exists with ilp64 API. Please take a look whet mkl linker adviser will suggest how to properly link with ilp64 cases.

daren__wall · ‎06-04-2019

Many thanks for your suggestion re: the mkl link adviser.

There were a few possible choices of how to link the code; I had an idea that dynamic linking with

openMP threading may be the best option, but I compiled and executed a number of possible options (10 in all).

The good news I can report is that actually all ten choices listed below led to successful execution ; problem solved!

I write the actual compilation commands below along with the execution wall time in case they should be

of interest to other programmers. The conclusions are that openMP (unsurprisingly) offers a significant speedup over sequential ,

a dynamically linked code will slightly outperform a statically linked code all other options being the same.

For those intending to call PDGESV from their codes, I believe the fortran program attached above makes a good compact scalable test program, please use it freely.

Many thanks once again for you assistance, it is much appreciated. Perhaps you could offer a concise sentence

just to explain why the different linking options used below, as suggested by the link adviser, led to a resolution of the problem

- is it fair to say it was a large integer problem ?

---------------------------------------------------------------------------------

Compilation and execution times (using intel compiler version 18.3):

[1] (we add -mcmodel=medium to the link adviser suggestion, dynamic linking, openmp and explicit linking to mkl )

Execution wall clock time: 18 mins 5 secs
mpiifort -mcmodel=medium -i8 -I${MKLROOT}/include -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_ilp64 -liomp5 -lpthread -lm -ldl -o para01.exe solve_by_lu_parallelmpi_simple_light2.for -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_ilp64 -liomp5 -lpthread -lm -ldl

[2] (we add mcmodel and m64; dynamic linking , openmp and explicit linking to mkl )

Execution wall clock time: 18 mins 2 secs

mpiifort -mcmodel=medium -m64 -i8 -I${MKLROOT}/include -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_ilp64 -liomp5 -lpthread -lm -ldl -o para02.exe solve_by_lu_parallelmpi_simple_light2.for -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_ilp64 -liomp5 -lpthread -lm -ldl

[3]
(we only add mcmodel; static linking, openMP and linking explicitly with mkl libraries):

Execution wall clock time: 18 mins 33 secs

mpiifort -mcmodel=medium -i8 -I${MKLROOT}/include ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl -o para03.exe solve_by_lu_parallelmpi_simple_light2.for ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl

[4] (we only add mcmodel and m64; static linking, openMP and linking explicitly with mkl libraries):

Execution wall clock time: 18 mins 33 secs
mpiifort -mcmodel=medium -m64 -i8 -I${MKLROOT}/include ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl -o para04.exe solve_by_lu_parallelmpi_simple_light2.for ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl

[5]

(just mcmodel added ; dynamic linking , sequential (no openmp) :

Execution wall clock time: 56 mins 15 secs

mpiifort -mcmodel=medium -i8 -I${MKLROOT}/include -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_ilp64 -lpthread -lm -ldl -o para05.exe solve_by_lu_parallelmpi_simple_light2.for -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_ilp64 -lpthread -lm -ldl

[6]

(we only add mcmodel and m64; dynamic linking , sequential (no openmp):

Execution wall clock time: timing 56 mins 10 secs

mpiifort -mcmodel=medium -m64 -i8 -I${MKLROOT}/include -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_ilp64 -lpthread -lm -ldl -o para06.exe solve_by_lu_parallelmpi_simple_light2.for -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_ilp64 -lpthread -lm -ldl

[7]

(just mcmodel added) static linking sequential (no openmp):

Execution wall clock time: 1 hour 5 mins

mpiifort -mcmodel=medium -i8 -I${MKLROOT}/include ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_sequential.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -lpthread -lm -ldl -o para07.exe solve_by_lu_parallelmpi_simple_light2.for ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_sequential.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -lpthread -lm -ldl

[8] (just mcmodel and m64 added; static linking sequential (no openmp):

Execution wall clock time: 1 hour 5 mins

mpiifort -mcmodel=medium -m64 -i8 -I${MKLROOT}/include ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_sequential.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -lpthread -lm -ldl -o para08.exe solve_by_lu_parallelmpi_simple_light2.for ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_sequential.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -lpthread -lm -ldl

[9]
(no additions by me; dynamic link, openmp no mcmodel):

Execution wall clock time: 18 mins 3 secs

mpiifort -i8 -I${MKLROOT}/include -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_ilp64 -liomp5 -lpthread -lm -ldl -o para09.exe solve_by_lu_parallelmpi_simple_light2.for -L${MKLROOT}/lib/intel64 -lmkl_scalapack_ilp64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_ilp64 -liomp5 -lpthread -lm -ldl

[10]

(no additions by me; static link, openmp no mcmodel) :

Execution wall clock time: 18 min 30 secs

mpiifort -i8 -I${MKLROOT}/include ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl -o para10.exe solve_by_lu_parallelmpi_simple_light2.for ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl