mkl_pardiso time consumption during solution phase

Hazra__Dhiraj_Kumar · ‎03-05-2015

Hello,

I am working with a sparse matrix A and trying to solve <Ax =b>. b is an array of 1 million and it is in double precision. The matrix is sufficiently sparse (number of non-zeros in A (%): 0.000694). I am using mkl_pardiso.f90 in 4 nodes. The factorization is taking 10 min to complete but I was expecting that the solution phase shall not take longer but it is more than one hour and it is still in that phase. Is this normal ? I provide the output till now (before solution phase). Can anybody please share any ideas in order to improve this situation ? Any help will be much appreciated.

=== PARDISO: solving a real nonsymmetric system ===
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON


Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.081618 s
Time spent in reordering of the initial matrix (reorder)         : 1.912543 s
Time spent in symbolic factorization (symbfct)                   : 1.559847 s
Time spent in data preparations for factorization (parlist)      : 0.075198 s
Time spent in allocation of internal data structures (malloc)    : 0.340503 s
Time spent in additional calculations                            : 0.220093 s
Total time spent                                                 : 4.189802 s

Statistics:
===========
Parallel Direct Factorization is running on 4 OpenMP

< Linear system Ax = b >
             number of equations:           1000000
             number of non-zeros in A:      6940000
             number of non-zeros in A (%): 0.000694

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 96
             number of independent subgraphs:  0
             number of supernodes:                    654600
             size of largest supernode:               11181
             number of non-zeros in L:                782333465
             number of non-zeros in U:                766664027
             number of non-zeros in L+U:              1548997492
 Reordering completed ... 
 Number of nonzeros in factors =  1548997492
 Number of factorization MFLOPS =  10596105
=== PARDISO is running in In-Core mode, because iparam(60)=0 ===
Percentage of computed non-zeros for LL^T factorization
 0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100 

=== PARDISO: solving a real nonsymmetric system ===
Single-level factorization algorithm is turned ON


Summary: ( factorization phase )
================

Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct)                        : 953.925513 s
Time spent in allocation of internal data structures (malloc)    : 0.170726 s
Time spent in additional calculations                            : 0.089973 s
Total time spent                                                 : 954.186212 s

Statistics:
===========
Parallel Direct Factorization is running on 4 OpenMP

< Linear system Ax = b >
             number of equations:           1000000
             number of non-zeros in A:      6940000
             number of non-zeros in A (%): 0.000694

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 96
             number of independent subgraphs:  0
             number of supernodes:                    654600
             size of largest supernode:               11181
             number of non-zeros in L:                782333465
             number of non-zeros in U:                766664027
             number of non-zeros in L+U:              1548997492
             gflop   for the numerical factorization: 10596.105218

             gflop/s for the numerical factorization: 11.107896

 Factorization completed ...

Thanks,

Dhiraj

mecej4 · ‎03-05-2015

The ratio of the number of non-zeroes in the L and U factors is over 200 times the number of non-zeroes in the original matrix. That is a lot of fill-in, and is probably responsible for the solution phase taking unexpectedly large execution time.

Hazra__Dhiraj_Kumar · ‎03-05-2015

Hello,

Thanks a lot for your explanation. Actually my problem is that I have to do the factorization only once and then solution phase has to be repeated. So is there a way to do the fill in stuff once and for all so that if keeps the matrices in memory (I guess it automatically does unless phase=-1 is called). Specifically speaking I want something between phase 22 and 33, i.e. the fill-in part to be done at initialization stage, since only array b is going to change in my program. Or even during the fill-in it needs b ?

Thanks again for your help,

Dhiraj

mecej4 · ‎03-06-2015

You can make one set of calls (or single call) to complete phases 1 and 2 at the beginning of your run. Then, you can write a loop in which you assign new values to the r.h.s. vector b ( as in A.x = b), and obtain the solution with PHASE=33, do what you want with the solution x, and go on to the next r.h.s.

However, given the large fill-in that occurs during factorization of your atypical matrix, it can very well happen that the solution phase consumes CPU time that is not negligible compared to the factorization phase, especially if multiple solutions are asked for.

Hazra__Dhiraj_Kumar · ‎03-06-2015

Hello,

Thank you for your answer. I am already calling the 11 and 22 phase at the initialization stage as my matrix A is not going to change. I am just calling 33 in a loop where matrix b gets changed and I get different x. And the solution phase is taking significantly larger time, seems that 10 times more than the factorization phase. Now I have 1 query. I shall be thankful if you can help in this matter.

The solution step is anyway going to take time as you said. I guess the forward and backward substitution for these elements takes time. In this case, can the cluster sparse solve help? I am yet to know the detailed structure of the cluster sparse solve but it seems that the more node I use, the faster the program will run. Am I right? or the result phase is expected to take nearly equal amount of time like this ?

Thanks again,

Dhiraj

mecej4 · ‎03-06-2015

It may be worthwhile for you to attempt to renumber the variables/reorder the equations so as to reduce the fill-in. If that fails, you can use the somewhat brute force approach of using a cluster solver, but I do not have any experience with clusters. Pardiso gives you some choice as to the algorithms available for reordering.

Alexander_K_Intel2 · ‎03-09-2015

Hi Dhiraj,

Could you provide full log with time of solving step? it is not expected that time of solving step was bigger than factorization time only if number of rhs is huge.

Thanks,

Alex

Hazra__Dhiraj_Kumar · ‎03-09-2015

Hi Alexander,

Actually last time I stopped the program after 1 hour. Today I ran it again and found that at nearly one hour the program stopped (automatically killed). It seems that the desktop ram (8GB) exhausted during the solution phase. I started the run again on our cluster and found that the program runs completely fine. The output is provided below. The cluster has 32 GB ram. Hence it seems that it is completely a ram problem. But I still do not understand when in cluster it took only 0.85 seconds in the solution phase why the desktop took so long before it got killed. In a smaller size problem (where size of b is 400000) computation time in the cluster and in my desktop was similar. Anyway my main problem here is solved. Thanks for all your input

Cheers,

Dhiraj

=== PARDISO: solving a real nonsymmetric system ===
The local (internal) PARDISO version is                          : 103911000
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON


Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.101121 s
Time spent in reordering of the initial matrix (reorder)         : 1.829722 s
Time spent in symbolic factorization (symbfct)                   : 3.486545 s
Time spent in data preparations for factorization (parlist)      : 0.096258 s
Time spent in allocation of internal data structures (malloc)    : 0.447319 s
Time spent in additional calculations                            : 0.271942 s
Total time spent                                                 : 6.232907 s

Statistics:
===========
< Parallel Direct Factorization with number of processors: > 20
< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b >
             number of equations:           1000000
             number of non-zeros in A:      6940000
             number of non-zeros in A (%): 0.000694

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 72
             number of independent subgraphs:  0
             number of supernodes:                    655255
             size of largest supernode:               11186
             number of non-zeros in L:                781479562
             number of non-zeros in U:                768317946
             number of non-zeros in L+U:              1549797508
 Reordering completed ... 
 Number of nonzeros in factors =  1549797508
 Number of factorization MFLOPS =  10707689
=== PARDISO is running in In-Core mode, because iparam(60)=0 ===
Percentage of computed non-zeros for LL^T factorization
 0 %  1 %  2 %  3 %  4 %  5 %  6 %  7 %  8 %  9 %  10 %  11 %  12 %  13 %  14 %  15 %  16 %  17 %  18 %  19 %  20 %  21 %  22 %  23 %  24 %  25 %  26 %  27 %  28 %  29 %  30 %  31 %  32 %  33 %  35 %  36 %  37 %  39 %  40 %  41 %  42 %  43 %  44 %  45 %  47 %  48 %  49 %  51 %  52 %  54 %  55 %  57 %  59 %  60 %  62 %  63 %  65 %  67 %  69 %  71 %  73 %  75 %  77 %  79 %  82 %  84 %  86 %  89 %  92 %  93 %  95 %  96 %  98 %  99 %  100 % 

=== PARDISO: solving a real nonsymmetric system ===
Single-level factorization algorithm is turned ON


Summary: ( factorization phase )
================

Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000001 s
Time spent in factorization step (numfct)                        : 140.436790 s
Time spent in allocation of internal data structures (malloc)    : 0.001189 s
Time spent in additional calculations                            : 0.000002 s
Total time spent                                                 : 140.437982 s

Statistics:
===========
< Parallel Direct Factorization with number of processors: > 20
< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b >
             number of equations:           1000000
             number of non-zeros in A:      6940000
             number of non-zeros in A (%): 0.000694

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 72
             number of independent subgraphs:  0
             number of supernodes:                    655255
             size of largest supernode:               11186
             number of non-zeros in L:                781479562
             number of non-zeros in U:                768317946
             number of non-zeros in L+U:              1549797508
             gflop   for the numerical factorization: 10707.689303

             gflop/s for the numerical factorization: 76.245614
 Factorization completed ... 


=== PARDISO: solving a real nonsymmetric system ===


Summary: ( solution phase )
================

Times:
======
Time spent in direct solver at solve step (solve)                : 0.850383 s
Time spent in additional calculations                            : 1.698694 s
Total time spent                                                 : 2.549077 s

Statistics:
===========
< Parallel Direct Factorization with number of processors: > 20
< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b >
             number of equations:           1000000
             number of non-zeros in A:      6940000
             number of non-zeros in A (%): 0.000694

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 72
             number of independent subgraphs:  0
             number of supernodes:                    655255
             size of largest supernode:               11186
             number of non-zeros in L:                781479562
             number of non-zeros in U:                768317946
             number of non-zeros in L+U:              1549797508
             gflop   for the numerical factorization: 10707.689303

             gflop/s for the numerical factorization: 76.245614
 Solve completed ...