Solved: Re: calling PARDISO in FORTRAN gives segmentation fault

pwoitke · ‎08-11-2020

Dear Developers,

when I test call PARDISO_64 from a small F90 test program (see below), everything works as expected. But when I try solving a larger system A x = b (number of equations = 3500, number of matrix entries = 837969), I get a segmentation fault.

To verify my problem, please first compile and run test.f90 attached as it, which should work. Then comment out the bits that set up the small test matrix and uncomment the part below which reads in a big matrix from a file called "fort.93" which can be found here http://www-star.st-and.ac.uk/~pw31/fort.93

What am I doing wrong?

Thank you,
Peter

! This is test.f90
! compile with ifort -g -mkl=sequential test.f90

INCLUDE 'mkl_pardiso.f90' ! Include the standard pardiso header file

PROGRAM PardisoTest
use MKL_PARDISO
implicit none
TYPE(MKL_PARDISO_HANDLE),pointer :: PT(:) => null()
integer*8,allocatable,dimension(:) :: idum,ia,ja
real*8,allocatable,dimension(:) :: ddum,a,b,x
integer*8 :: Neq, Nmax
integer*8 :: maxfct, mnum, mtype, phase, nrhs, error, msglvl
integer*8 :: iparm(64)
integer*8 :: i,i1,i2,j,n,k
real*8 :: test

Neq = 8
Nmax = 19
allocate(a(Nmax),ia(Neq+1),ja(Nmax),b(Neq),x(Neq))
ia = (/1,5,8,10,12,15,17,18,20/)
ja = (/1,3,6,7,2,3,5,3,8,4,7,2,5,7,1,6,7,3,8/)
a = (/ 1.d0, 2.d0, 3.d0, 4.d0, 5.d0, 6.d0, 7.d0, 8.d0, 9.d0, &
10.d0,11.d0,12.d0,13.d0,14.d0,15.d0,16.d0,17.d0,18.d0,19.d0/)
b(1:Neq) = 1.d0
!( 1 . 2 . . 3 4 . )
!( . 5 6 . 7 . . . )
!( . . 8 . . . . 9 )
!( . . . 10 . . 11 . )
!( . 12 . . 13 . 14 . )
!( 15 . . . . 16 . . )
!( . . . . . . 17 . )
!( . . 18 . . . . 19 )

!print*,"reading sparse matrix from file ..."
!open(unit=93,file='fort.93',status='old')
!read(93,*) Neq,Nmax
!allocate(a(Nmax),ia(Neq+1),ja(Nmax),b(Neq),x(Neq))
!do k=1,Neq+1
! read(93,*) ia(k)
!enddo
!do k=1,Nmax
! read(93,*) ja(k),a(k)
!enddo
!do k=1,Neq
! read(93,*) b(k)
!enddo
!close(93)
!print*,b

allocate(PT(64),idum(0),ddum(0))
PT(1:64)%DUMMY = 0
iparm(1:64) = 0
iparm(1) = 0 ! solver default
iparm(8) = 2 ! max numbers of iterative refinement steps
error = 0    ! initialize error flag
msglvl = 1   ! print statistical information
mtype = 11   ! real unsymmetric sparse matrix
nrhs = 1
maxfct = 1
mnum = 1

phase = 11 ! reordering and symbolic factorization
CALL PARDISO_64(PT, maxfct, mnum, mtype, phase, Neq, &
     a, ia, ja, idum, nrhs, iparm, msglvl, ddum, ddum, error)
print*,"PHASE 11: error=",error

phase = 22 ! factorization
CALL PARDISO_64(pt, maxfct, mnum, mtype, phase, Neq, &
     a, ia, ja, idum, nrhs, iparm, msglvl, ddum, ddum, error)
print*,"PHASE 22: error=",error

phase = 33 ! back substitution and iterative refinement
CALL PARDISO_64(pt, maxfct, mnum, mtype, phase, Neq, &
     a, ia, ja, idum, nrhs, iparm, msglvl, b, x, error)
print*,"PHASE 33: error=",error

print*," ia=",ia
print*," ja=",ja
print*," a=",a
print*," b=",b
print*," x=",x

do n=1,Neq
i1 = ia(n)
i2 = ia(n+1)-1
test = 0.d0
do i=i1,i2
j = ja(i)
test = test + a(i)*x(j)
print*,a(i),j
enddo
print*,test,b(n),test/b(n)-1.d0
enddo

end

Kirill_V_Intel · ‎08-12-2020

Hi Peter,

1) I now understand. When you were running with iparm(1)=0 the nested dissection algorithm from METIS was used. Now, when you switched off the "default" mode and set iparm(1)=1, since you had iparm(2)=0, a minimum degree algorihm was used. The default one is though iparm(2)=2, non-parallel METIS.
So, the speedup was not coming from using OpenMP but from using a better algorithm (you had a nested dissection vs minimum degree algorithm).
I know that all these explanations look a bit messy but historically we happened to have these options listed and used as they are now.
So, I strongly recommend just always use either iparm(2)=2 or iparm(2)=3.

As for -mkl=sequential, I now understand your use case. I believe you can definitely use multi-threaded MKL for PARDISO and single-threaded LAPACK. But to do this, you need to link with parallel mkl and restrict the number of threads for LAPACK rather than link with sequential MKL.

Have a look at https://software.intel.com/content/www/us/en/develop/documentation/mkl-macos-developer-guide/top/managing-performance-and-memory/improving-performance-with-threading/calling-intel-mkl-functions-from-multi-threaded-applications.html.

Briefly, if you call an MKL function from a parallel region, MKL can only be parallel if OMP_NESTED=true. If disabling nested OpenMP parallelism is not an option, you can use mkl_domain_set_num_threads and make all LAPACK calls sequential.

The number of threads used by MKL functions can also additionally be reduced to one, if needed, by calling mkl_set_num_threads(1) or mkl_domain_set_num_threads(1, <domain name, e.g. MKL_DOMAIN_LAPACK>).

So, one option would be to disable nested parallelism before calling LAPACK and enable it back for calling PARDISO (the latter is needed only if you call pardiso from a parallel region). You can check with setting MKL_VERBOSE=1 that a sequential LAPACK will be called when you do the changes.
The other one is to use mkl_domain_set_num_threads.

2)3) You're right that typically large changes in the matrix values can bring problems and scaling & matching should help. So it makes sense to use these options but I'd probably first try without them to see, maybe the accuracy won't be bad. In general, if you have an example of a matrix which shows poor accuracy, I'd be interested to have a look and use it as one of our tests when we actively work on improving matching & scaling.

I can't suggest any good number of iterative refinement steps. Also, keep in mind that it might not help if the matrix is ill-conditioned. By the way, we have an option for iparm(8) to be negative which use better precision for the residual computation and might help the iterative refinement.
Another option is to play with pivoting, iparm(10). You might want to check the number of pivots, though, before doing it (check iparm(14)).

Best,
Kirill

View solution in original post

Kirill_V_Intel · ‎08-11-2020

Hi Peter,

Thanks for reporting the issue! I don't see anything wrong in your setup, I've reproduced the failure and we're going to identify the exact root cause. Currently I only see that it happens on your matrix when matching is ON (iparm(13) = 1).

As a quick workaround, you can try to switch from using iparm(1)=0 (in fact, I strongly recommend against using it in general) and for now just set iparm(1) = 1 (no solver default). It worked for me, so let me know if this does not work for you.

Best,
Kirill

pwoitke · ‎08-12-2020

Hello Kirill,

thank you for this comment. I tried iparm(1)=1, iparm(11)+1 and iparm(13)=1, but no difference.

I think I understand now what is wrong. It seems that the ja(:) entries between ia(k) to ia(k+1)-1, for each k, must be sorted from low to high which is currently not the case in my large matrix, whereas it is correctly sorted in my small example. If you run my example with iparm(27)=1, it tells the issue.

That is unfortunately not so easy to change for me. The way that the matrix is collected (ray-based radiative transfer), the information is produced in a non-sorted way and I need to think about a efficient way to make is sorted.

Thanks, Peter

mecej4 · ‎08-12-2020

Given that your CSR entries are presently not sorted by column for each row, please also check if you could have multiple entries for some columns, as can happen when a global stiffness matrix is being assembled from element matrices in a FEA program.

Kirill_V_Intel · ‎08-12-2020

Hi Peter,

Yes, this is most likely the root cause. As an alternative option to changing your matrix generation, you can have a look at mkl_sparse_order https://software.intel.com/content/www/us/en/develop/documentation/mkl-developer-reference-c/top/blas-and-sparse-blas-routines/inspector-executor-sparse-blas-routines/matrix-manipulation-routines/mkl-sparse-order.html. In order to use this routine, you can put your CSR data into a sparse matrix handle for IE SpBLAS via mkl_sparse_?_create_csr and then call mkl_sparse_order to sort the arrays.

As @mecej4 said, please also check for the duplicates. Currently, IE SpBLAS routines don't support duplicate entries for CSR matrices.

UPD: I forgot that you're using INTEGER*8. You cannot then call IE SpBLAS with long integers as you're linking against the lp64 interface library and essentially, you need mkl_sparse_order from ilp64. So, in order to go this way and use IE SpBLAS, you'll need to switch to ilp64 interface. This would affect other places in your code where you might be using MKL so be careful about it. Let me know if you need more details about how to do that.

Best,
Kirill

pwoitke · ‎08-12-2020

Thank you all for your comments.

I think the issue is solved. I have modified the sparse matrix collection program, and it seems to work fine now. I have re-newed http://www-star.st-and.ac.uk/~pw31/fort.93, and have optimised the iparm array a bit, and I get reasonable performance now, see test.f90 attached. These was no issue with
double entries.

Minor questions:

* using iparm(2)=3 results in factor 10 performance increase or so, although the output still says
"< Parallel Direct Factorization with number of processors: > 1". Is there a way to use more OpenMP
processes to speed it up even further? Thing that the program in which I want to use PARDISO needs
to be compiled with -mkl=sequential. Does that disable using more OpenMP processes? Is there a
way to overrule this option in just a certain part of my code?

* do I understand correctly that when the structure of the sparse matrix stays the same, I can skip phase
1 after the first call and start with phase 2? Things that that, yes, the sructure stays the same, but the
matrix entries might change by order of magnitude between calls.

* are there any other tips and tricks how to (i) increase performance and (ii) to improve accuracy
beyond my settings of iparm(:) in the attached file?

Thank you for your assistence,
Peter

Kirill_V_Intel · ‎08-12-2020

Hi,

I'm glad you got the issue solved. Asnwering your questions:

1) Your statement is surprising to me, I would not expect any difference with iparm(2)=3 since you're using a sequential version of MKL. I'd suspect that something else was different in your runs.
Could you give a bit more details about what you see: do you see 10x speed up for the phase 1 separately? Are you using the same matrix? Actually, 3500x3500 matrix is small so it might be even faster to solve it as a dense matrix with LAPACK. If you increased the size of the matrix, how large was it (basically, could you share the output with msglvl=1).
An even more important question is why do you need to link with the sequential version of MKL? If you want to use OpenMP, it'd make more sense to link with mkl_intel_thread, e.g., and set the number of threads to 1 via MKL service routines, see e.g. https://software.intel.com/content/www/us/en/develop/documentation/mkl-developer-reference-c/top/support-functions/threading-control.html.

2) Not exactly. Since you use matching & scaling during phase 1, the result of phase 1 depends on he values of the matrix (these two features use the matrix values). So, the short answer would be no.

However, if you fine with disabling matching and scaling, there are several high-level features of PARDISO which can be used when there are multiple systems with the same nonzero patterns.
First, you can store multiple factorizations inside a single handle, so you can do reordering once and then factorization (and solve) for each of the matrices via maxfct and mnum parameters.
Second, if the values are changing only in a known small subset of the matrix, you can use the so called "low rank" feature, see iparm(39). This will allow to re-use even a part of the factorization which would remain unchanged. See also the example pardiso_lowrank_c.c

3)
Performance: one of the main powers of MKL PARDISO is the efficient multi-threaded implementation so I'd definitely use that if possible. If we talk about sequential code only, I'd think about the options listed above which allow to re-use some work done in case when multiple systems are solved. If reordering is the bottleneck, one can try VBSR format (see iparm[36]).
Also, if you can turn off matching & scaling, you would want to try two-level factorization (iparm(24)=1 or 10). In most cases the two-level factorization performs better. When matching & scaling are ON, classic factorization corresponding to iparm(24)=0 will be used always.

Accuracy: why do you want to increase that, do you have ill-conditioned systems and do you actually see a large error? An easy option would be to increase the number of iterative refinement steps. Typically, scaling and matching (and pivoting for symmetric matrices) help with accuracy issues but you already have them turned on. With that said, matching (and scaling, but less noticeably) can reduce performance as it interferes with the final result of the reordering and often increases the fill-in.

Best,
Kirill

pwoitke · ‎08-12-2020

Hello Kirill,

many thanks for all those remarks. A lot of things to explain, so let's do it step by step:

1) just try it yourself. Use my last test.f90 and my last fort.93 and run it with iparm(2)=0, then it takes about 64 sec (about 60 sec in phase 1), but when I use iparm(2)=3, it's only 3.8 sec, that's a factor close to 20! When I link with -mkl=parallel, computation time goes down further, that's great!

Still 1) Why use -mkl=sequential? OK, maybe you have given me an important hint there already, just to
explain once more what I am after. Large parts of my computational project are OpenMP parallel doing radiative transfer and astrochemistry. In these parallel parts, there are many calls to mkl-LAPACK routines where I need sequential computations, because the OpenMp-threads all do their independent work. However, the part where I want to use PARDISO is in a non-parallel program part, so in principle there are many idle processes that could share the work. Now, what I do not know is how to tell mkl about these circumstances? Are there Fortran commands by which I can switch mkl-mode? Can you point me to an example F90 file where this is done? I will still check the link that you have given already. If I just use -mkl=parallel, then I am assuming that the above LAPACK calls would interfere with each other and fail, or is that not the case? I think I have actually never tried that, assuming that I absolutely need sequential LAPACK in all cases ...

2) I am not 100% sure, but I think that matching and scaling are important, because the physical equations to be solved need to simultaneously consider a wide range of conditions where densities and temperatures vary both by orders of mag, so I will probably not look into these extra options for the moment, thanks.

3) thanks for these ideas, will try them one by one.

Sill 3) accuracy: yes, I do have other cases where accuracy is a major problem. I could generate another fort.93 to demonstrate such a case. So it may be a case of "ill-conditioned" as you say, but because of the physics, as described in (2), I don't know how to better condition them. The equations to be solved are just quite challenging. I will try more iterative refinements. What is a useful large number?

Thanks, Peter

Kirill_V_Intel · ‎08-12-2020

Hi Peter,

1) I now understand. When you were running with iparm(1)=0 the nested dissection algorithm from METIS was used. Now, when you switched off the "default" mode and set iparm(1)=1, since you had iparm(2)=0, a minimum degree algorihm was used. The default one is though iparm(2)=2, non-parallel METIS.
So, the speedup was not coming from using OpenMP but from using a better algorithm (you had a nested dissection vs minimum degree algorithm).
I know that all these explanations look a bit messy but historically we happened to have these options listed and used as they are now.
So, I strongly recommend just always use either iparm(2)=2 or iparm(2)=3.

As for -mkl=sequential, I now understand your use case. I believe you can definitely use multi-threaded MKL for PARDISO and single-threaded LAPACK. But to do this, you need to link with parallel mkl and restrict the number of threads for LAPACK rather than link with sequential MKL.

Have a look at https://software.intel.com/content/www/us/en/develop/documentation/mkl-macos-developer-guide/top/managing-performance-and-memory/improving-performance-with-threading/calling-intel-mkl-functions-from-multi-threaded-applications.html.

Briefly, if you call an MKL function from a parallel region, MKL can only be parallel if OMP_NESTED=true. If disabling nested OpenMP parallelism is not an option, you can use mkl_domain_set_num_threads and make all LAPACK calls sequential.

The number of threads used by MKL functions can also additionally be reduced to one, if needed, by calling mkl_set_num_threads(1) or mkl_domain_set_num_threads(1, <domain name, e.g. MKL_DOMAIN_LAPACK>).

So, one option would be to disable nested parallelism before calling LAPACK and enable it back for calling PARDISO (the latter is needed only if you call pardiso from a parallel region). You can check with setting MKL_VERBOSE=1 that a sequential LAPACK will be called when you do the changes.
The other one is to use mkl_domain_set_num_threads.

2)3) You're right that typically large changes in the matrix values can bring problems and scaling & matching should help. So it makes sense to use these options but I'd probably first try without them to see, maybe the accuracy won't be bad. In general, if you have an example of a matrix which shows poor accuracy, I'd be interested to have a look and use it as one of our tests when we actively work on improving matching & scaling.

I can't suggest any good number of iterative refinement steps. Also, keep in mind that it might not help if the matrix is ill-conditioned. By the way, we have an option for iparm(8) to be negative which use better precision for the residual computation and might help the iterative refinement.
Another option is to play with pivoting, iparm(10). You might want to check the number of pivots, though, before doing it (check iparm(14)).

Best,
Kirill

pwoitke · ‎08-19-2020

Kirill,

many thanks for your comments. Setting OMP_NESTED=0 seems to work fine when I link with mkl-parallel. I am also trying to scale both solution vector and r.h.s. with physically reasonable numbers to make things unit-free, so avoiding scaling and matching in PARDISO might be an option.

I have created a bit more challenging example, with dimension of the linear eqs. system = 87477
and number of entries in sparse matrix = 16293396 which is 0.21% fullness. Matrix entries range
from -83.6 to +1.0 and right hand side entries will be very small to about +/- 0.01. Please use the attached test.f90 and download (and gunzip) this file: http://www-star.st-and.ac.uk/~pw31/sparse.dat.gz. When you run it the output should result in the output copied below. Questions:

I am trying to make PARDISO use 6 OpenMP processes, but it only uses 4 - why is that?
The worrying bit is that the quality of the solution, see last bits of my Fortran program,
reports infinity. The solution still seems OK, at least it does help in the main application to get something else converging, but this is a bit worrying. Is there a bitter way to check the "quality of the solution"?
If you have any ideas to speed up things, or to make things more robust, please let me know.

Thanks, Peter

OMP_NESTED = 0
MKL_PARDISO_OOC_MAX_CORE_SIZE = 5000
reading sparse matrix from file ... done.
Hi from OpenMP thread 1
Hi from OpenMP thread 4
Hi from OpenMP thread 3
Hi from OpenMP thread 5
Hi from OpenMP thread 6
Hi from OpenMP thread 2
number of available OpenMP processes: 6
ooc_max_core_size got by Env=5000
The file ./pardiso_ooc.cfg was not opened
=== PARDISO is running in In-Core mode, because iparam(60)=1 and there is enough RAM for In-Core ===

Percentage of computed non-zeros for LL^T factorization
1 % 2 % 3 % 4 % 5 % 6 % 7 % 8 % 9 % 10 % 11 % 12 % 13 % 14 % 15 % 16 % 17 % 18 % 19 % 20 % 21 % 22 % 23 % 24 % 25 % 26 % 27 % 28 % 29 % 30 % 31 % 32 % 33 % 34 % 35 % 36 % 37 % 38 % 39 % 40 % 41 % 42 % 43 % 44 % 45 % 46 % 47 % 48 % 49 % 50 % 51 % 52 % 53 % 54 % 56 % 59 % 61 % 62 % 63 % 64 % 65 % 66 % 67 % 68 % 69 % 70 % 72 % 73 % 74 % 76 % 77 % 79 % 81 % 84 % 85 % 86 % 89 % 90 % 92 % 94 % 95 % 96 % 97 % 98 % 99 % 100 %

=== PARDISO: solving a real nonsymmetric system ===
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON
Single-level factorization algorithm is turned ON

Summary: ( starting phase is reordering, ending phase is solution )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.359208 s
Time spent in reordering of the initial matrix (reorder) : 3.023916 s
Time spent in symbolic factorization (symbfct) : 1.810453 s
Time spent in data preparations for factorization (parlist) : 0.036601 s
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct) : 38.837004 s
Time spent in direct solver at solve step (solve) : 0.161599 s
Time spent in allocation of internal data structures (malloc) : 1.288380 s
Time spent in additional calculations : 1.799395 s
Total time spent : 47.316556 s

Statistics:
===========
Parallel Direct Factorization is running on 4 OpenMP

< Linear system Ax = b >
number of equations: 87477
number of non-zeros in A: 16293396
number of non-zeros in A (%): 0.212924

number of right-hand sides: 1

< Factors L and U >
number of columns for each panel: 96
number of independent subgraphs: 0
number of supernodes: 61228
size of largest supernode: 3628
number of non-zeros in L: 75399728
number of non-zeros in U: 73824437
number of non-zeros in L+U: 149224165
gflop for the numerical factorization: 490.937856
gflop/s for the numerical factorization: 12.640982

PHASE 13: error= 0
qual= Infinity

I am now trying to scale myself