Segfault in scalapack pdgetrf_

fpinria · ‎01-15-2025

Hi,

In order to compare several parallel dgetrf implementations I'm trying to use the OneAPI 2025 Scalapack pdgetrf_ function.
The problem is that for large enough matrices (say N>120000), I get a segfault in the call of pdgetrf_

/beegfs/pruvost/intel/oneapi/mkl/2025.0/lib/libmkl_avx512.so.2(mkl_blas_avx512_xidamax+0x67) [0x2b8c4bb6f817]
/beegfs/pruvost/intel/oneapi/mkl/2025.0/lib/libmkl_gnu_thread.so.2(mkl_blas_idamax+0x34b) [0x2b875380725b]
/beegfs/pruvost/intel/oneapi/mkl/2025.0/lib/libmkl_intel_lp64.so.2(idamax+0x3d) [0x2b8752a7827d]
/beegfs/pruvost/intel/oneapi/mkl/2025.0/lib/libmkl_scalapack_lp64.so.2(pdamax_+0x3d5) [0x2b87525a92e5]
/beegfs/pruvost/intel/oneapi/mkl/2025.0/lib/libmkl_scalapack_lp64.so.2(pdgetf3_+0x221) [0x2b875249dff1]
/beegfs/pruvost/intel/oneapi/mkl/2025.0/lib/libmkl_scalapack_lp64.so.2(pdgetf3_+0x44c) [0x2b875249e21c]
/beegfs/pruvost/intel/oneapi/mkl/2025.0/lib/libmkl_scalapack_lp64.so.2(pdgetf2_+0x18f) [0x2b8752266b2f]
/beegfs/pruvost/intel/oneapi/mkl/2025.0/lib/libmkl_scalapack_lp64.so.2(pdgetrf2_+0x6ea) [0x2b875226723a]
/beegfs/pruvost/intel/oneapi/mkl/2025.0/lib/libmkl_scalapack_lp64.so.2(pdgetrf_+0x97b) [0x2b875263d7fb]
/home/pruvost/git/dplasma/tools/cscalapack/tpdgetrf() [0x4025ce]

It smells like an int overflow somewhere in the scalapack code.
By the way I checked with the reference scalapack v2.2.2 and there is no segfault.
Thus, I tried to link with the ilp64 interface (64bit integers), it fixed the segfault issue, but execution is slower than using the lp64 interface.

My questions :
1) Is this bug already identified by the Intel developers ?
2) Is it normal to get the segfault with the standard interface, knowing that integers parameters given to pdgetrf_ are much smaller than 2^31-1.
If the problem is internal to OneAPI scalapack it should be fixed right (and since it works with reference scalapack) ?
3) Is it normal to get slower pdgetrf_ executions using the ilp64 interface instead of lp64 ?

Thanks.

Some details about building and running:

# build int32
rm common.o pdgetrf.o
mpicc -m64 -I"${MKLROOT}/include" -O0 -g -o common.o -c common.c
mpicc -m64 -I"${MKLROOT}/include" -O0 -g -o pdgetrf.o -c pdgetrf.c
mpif77 -o tpdgetrf_int32 pdgetrf.o common.o -m64 -L${MKLROOT}/lib -lmkl_scalapack_lp64 -Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lmkl_blacs_openmpi_lp64 -lgomp -lpthread -lm -ldl

# build int64
rm common.o pdgetrf.o
mpicc -DMKL_ILP64 -m64 -I"${MKLROOT}/include" -o common.o -c common.c
mpicc -DMKL_ILP64 -m64 -I"${MKLROOT}/include" -o pdgetrf.o -c pdgetrf.c
mpif77 -o tpdgetrf_int64 pdgetrf.o common.o -m64 -L${MKLROOT}/lib -lmkl_scalapack_ilp64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lmkl_blacs_openmpi_ilp64 -lgomp -lpthread -lm -ldl

## Executed on 4 nodes (8 MPI processes, 1/socket), each node is composed of 2 processors of 18-core Cascade Lake Intel Xeon Skylake Gold 6240 @ 2.6 GHz.

# run int32
mpiexec --map-by socket ./tpdgetrf_int32 -N 57600 -t 576 -nruns 1 -P 2 -Q 4
[****] TIMEHL(s) 36.26635 : dgetrf PxQ= 2 4 NB= 576 N= 57600 : 3512.907704 gflops
mpiexec --map-by socket ./tpdgetrf_int32 -N 115200 -t 576 -nruns 1 -P 2 -Q 4
[****] TIMEHL(s) 291.38776 : dgetrf PxQ= 2 4 NB= 576 N= 115200 : 3497.776452 gflops
mpiexec --map-by socket ./tpdgetrf_int32 -N 144000 -t 576 -nruns 1 -P 2 -Q 4
==== backtrace (tid: 99392) ====
0 /gnu/store/ifh6wl2mkx3bnx3v1zpy6qyv53lmaanl-ucx-1.17.0/lib/libucs.so.0(ucs_handle_error+0x294) [0x2b87664ce004]
1 /gnu/store/ifh6wl2mkx3bnx3v1zpy6qyv53lmaanl-ucx-1.17.0/lib/libucs.so.0(+0x331bf) [0x2b87664ce1bf]
2 /gnu/store/ifh6wl2mkx3bnx3v1zpy6qyv53lmaanl-ucx-1.17.0/lib/libucs.so.0(+0x33486) [0x2b87664ce486]
3 /beegfs/pruvost/intel/oneapi/mkl/2025.0/lib/libmkl_avx512.so.2(mkl_blas_avx512_xidamax+0x67) [0x2b8c4bb6f817]
4 /beegfs/pruvost/intel/oneapi/mkl/2025.0/lib/libmkl_gnu_thread.so.2(+0x19ee5d) [0x2b8753806e5d]
5 /gnu/store/id7x9f869b94q9fnszmiy6dnvslpciyc-gfortran-11.4.0-lib/lib/libgomp.so.1(GOMP_parallel+0x46) [0x2b8758f34c26]
6 /beegfs/pruvost/intel/oneapi/mkl/2025.0/lib/libmkl_gnu_thread.so.2(mkl_blas_idamax+0x34b) [0x2b875380725b]
7 /beegfs/pruvost/intel/oneapi/mkl/2025.0/lib/libmkl_intel_lp64.so.2(idamax+0x3d) [0x2b8752a7827d]
8 /beegfs/pruvost/intel/oneapi/mkl/2025.0/lib/libmkl_scalapack_lp64.so.2(pdamax_+0x3d5) [0x2b87525a92e5]
9 /beegfs/pruvost/intel/oneapi/mkl/2025.0/lib/libmkl_scalapack_lp64.so.2(pdgetf3_+0x221) [0x2b875249dff1]
10 /beegfs/pruvost/intel/oneapi/mkl/2025.0/lib/libmkl_scalapack_lp64.so.2(pdgetf3_+0x44c) [0x2b875249e21c]
11 /beegfs/pruvost/intel/oneapi/mkl/2025.0/lib/libmkl_scalapack_lp64.so.2(pdgetf2_+0x18f) [0x2b8752266b2f]
12 /beegfs/pruvost/intel/oneapi/mkl/2025.0/lib/libmkl_scalapack_lp64.so.2(pdgetrf2_+0x6ea) [0x2b875226723a]
13 /beegfs/pruvost/intel/oneapi/mkl/2025.0/lib/libmkl_scalapack_lp64.so.2(pdgetrf_+0x97b) [0x2b875263d7fb]
14 /home/pruvost/git/dplasma/tools/cscalapack/tpdgetrf() [0x4025ce]
15 /gnu/store/hw6g2kjayxnqi8rwpnmpraalxi0djkxc-glibc-2.39/lib/libc.so.6(+0x29bf7) [0x2b875957cbf7]
16 /gnu/store/hw6g2kjayxnqi8rwpnmpraalxi0djkxc-glibc-2.39/lib/libc.so.6(__libc_start_main+0x7c) [0x2b875957ccac]
17 /home/pruvost/git/dplasma/tools/cscalapack/tpdgetrf() [0x402231]
=================================

# run int64
mpiexec --map-by socket ./tpdgetrf_int64 -N 57600 -t 576 -nruns 1 -P 2 -Q 4
[****] TIMEHL(s) 53.08035 : dgetrf PxQ= 2 4 NB= 576 N= 57600 : 2400.141236 gflops
mpiexec --map-by socket ./tpdgetrf_int64 -N 115200 -t 576 -nruns 1 -P 2 -Q 4
[****] TIMEHL(s) 298.32326 : dgetrf PxQ= 2 4 NB= 576 N= 115200 : 3416.459206 gflops
mpiexec --map-by socket ./tpdgetrf_int64 -N 144000 -t 576 -nruns 1 -P 2 -Q 4
[****] TIMEHL(s) 486.57299 : dgetrf PxQ= 2 4 NB= 576 N= 144000 : 4091.155198 gflops

Aleksandra_K · ‎01-21-2025

Hi,

I wanted to let you know that I'm working on the issue and I'll come back to you with my progress.

Regards,

Aleksandra

Aleksandra_K · ‎01-22-2025

Hi,

what is in the file flops.h?

fpinria · ‎01-22-2025

Hi,

Thanks for your answer.

The given code is actually taken from dplasma, and flops.h is https://github.com/ICLDisco/dplasma/blob/master/src/flops.h

Aleksandra_K · ‎02-10-2025

Hi,

In the pdgetrf.c, starting from line #59 there is

"A = malloc( sizeof(double)*mloc*nloc );"

will allocate size of mloc*nloc, which is N*N/np.

As you were using rank = 8, then :

- for N=115200 we have N*N/np = 115200*115200/8 =1,658,880,000 which within the range of int32 (which ranges from –2147483648 to 2,147,483,647)

- for N=144000 we have N*N/np = 144000*144000/8 = 2,592,000,000 which is greater than the highest possible values of int32.

As for the reference scalapack, did you use any integer-related flag when building it?

fpinria · ‎02-14-2025

Hi,

I've added a printf("allocate %ld double, address is %p\n", sizeof(double)*mloc*nloc, A); just after the malloc and this is what I have :

mpiexec --map-by socket ./tpdgetrf_int32 -N 144000 -t 576 -nruns 1 -P 2 -Q 4

allocate 20901888000 double, address is 0x2b61b1758010
allocate 20901888000 double, address is 0x2b36362f5010
allocate 20570112000 double, address is 0x2b6f80778010
allocate 20901888000 double, address is 0x2b7aecf56010
allocate 20570112000 double, address is 0x2ac3e8423010
allocate 20901888000 double, address is 0x2b7dd5b56010
allocate 20570112000 double, address is 0x2b3b90f56010
allocate 20570112000 double, address is 0x2affac000010

Hence the matrix A is well allocated and this works because in the multiplication sizeof(double)*mloc*nloc the implicit cast is done from left-to-right so that the resulting type is size_t (the return type of function sizeof).

So what I mean here is that I give a well allocated matrix A and integer parameters m, n that are also fine.

The bug is really inside mkl scalapack, maybe during a calculation of index, or in a pointer arithmetic I don't know.

Aleksandra_K · ‎02-13-2025

Did you have a chance to check whether for the reference scalapack you were using any integer-related flag when building it?

fpinria · ‎02-14-2025

The reference scalapack has been compiled with gcc-11.4.0 with these cmake flags: -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_VERBOSE_MAKEFILE=ON -DBUILD_SHARED_LIBS:BOOL=YES; so I don't see any specific integer-related things.

If I pick one compilation line for instance:

/gnu/store/piai5ba5fx901hr7pwnz2b9vysrrjmlh-gfortran-11.4.0/bin/gfortran -DAdd_ -Dscalapack_EXPORTS -I/gnu/store/gzs3ykx44mfvigbrzjq8vxmpjvmwbdaw-openmpi-4.1.6/include -O2 -g -DNDEBUG -fPIC -c /tmp/guix-build-scalapack-2.2.2.drv-0/source/SRC/pzgetf2.f -o CMakeFiles/scalapack.dir/SRC/pzgetf2.f.o
[ 62%] Building Fortran object CMakeFiles/scalapack.dir/SRC/pzgetrf.f.o

Aleksandra_K · ‎02-18-2025

You are absolutely right that the matrices are successfully allocated. However, for N>=131072, the total number elements in the matrix is greater than the maximum value of 32-bit integer. The lp64 interface uses 32-bit integers for indexing arrays (in contrary to ilp64, which uses 64-bit integers), hence such a matrix has too many entries to be indexed, which is most likely what causes the failure.

fpinria · ‎02-18-2025

Ok, this is clear. Thanks for your time.