Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Pardiso Threadripper 2990wx versus Ryzen 1700

Makhija__David
Beginner
567 Views

I have the same multi-physics finite element code generating a matrix. An old machine with a Ryzen 1700 (8 core) is faster than a threadripper 2990wx (32 core). Windows 10, intel64, mkl_rt.lib, and the MKL versions are 2018.1.156 for Ryzen 1700 and 2019.0.117 for Threadripper. I can provide an example matrix if it helps. Here are the options, which are same on both builds:

 

struct pardiso_struct
{
void *pt[64];
int maxfct{ 1 };
int mnum{ 1 };
int mtype{ 11 };
int n{ 0 };
int idum{ 0 }; //dummy not used by PARDISO when iparm(5-1) != 1
int nrhs{ 1 };
int iparm[64];
int msglvl{ 1 };
double ddum{ 0. };
int error{ 0 };
 
pardiso_struct()
{
// fill(pt, pt + 64, void(0)); does not work
for (int i = 0; i < 64; ++i)
pt = 0;
std::fill(iparm, iparm + 64, 0);
iparm[0] = 1; // 0 for all default, !=0 for any custom
iparm[1] = 3; // 0 minimum degree alg, metis, 3 openMP metis
  //iparm[2] // reserved
iparm[3] = 0; // For iterative methods
iparm[4] = 0; // user fill-in reducing permutation
iparm[5] = 0; // 0 - solution written on x, 1 - solution on b
  //iparm[6] output of number of iterative refinement steps
iparm[7] = 0; // iterative refinement steps
  //iparm[8] reserved
iparm[9] = 13; // pivoting, 13 for nonsymmetric, 8 for sym
iparm[10] = 1; // 0 no scaling, 1 scaling (1 Default for nonsym)
 
iparm[12] = 1; // 0 to disable weighted matching? 1 default for non-sym
   //iparm[13]-iparm[19] outputs
   //iparm[20] = special pivoting for symmetric but indefinite
   //iparm[21] output for number of pos eigs
   //iparm[22] output for number of neg eigs
iparm[23] = 1; // 0 for classic alg, 1 for openMP scalable > 8 procs
iparm[24] = 0; // 0 for parallel solve, 1 for sequential solve
   //iparm[25] // reserved
iparm[26] = 0; // 0 Do not check sparse mat, 1 check sparse mat
iparm[27] = 0; // 0 double precision, 1 single precision
   //iparm[28]  reserved;
   //iparm[29] output zero or neg pivots in sym
   //iparm[30] only solve for certain components...?
   //iparm[31][32] reserved
   //iparm[33] some reproduceability stuff
iparm[34] = 1; //0 one based indexing, 1 zero based indexing
   //iparm[35] something with schur complements
iparm[36] = 0; //0 CSR, >0 BSR, <0 convert to BSR
   //iparm[59] ooc options
 
}
};
 
 

The results of reorder and factorization are here. Solve (omitted here) is slower on 2990wx but the main concern is numerical factorization time. 

*************** Ryzen 7 1700 **********************

=== PARDISO: solving a real nonsymmetric system ===
0-based array is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON


Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.847928 s
Time spent in reordering of the initial matrix (reorder)         : 7.678907 s
Time spent in symbolic factorization (symbfct)                   : 2.075314 s
Time spent in data preparations for factorization (parlist)      : 0.098494 s
Time spent in allocation of internal data structures (malloc)    : 4.281882 s
Time spent in additional calculations                            : 3.785140 s
Total time spent                                                 : 18.767665 s

Statistics:
===========
Parallel Direct Factorization is running on 8 OpenMP

< Linear system Ax = b >
             number of equations:           1928754
             number of non-zeros in A:      46843184
             number of non-zeros in A (%): 0.001259

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 72
             number of independent subgraphs:  0
             number of supernodes:                    795666
             size of largest supernode:               9159
             number of non-zeros in L:                673935341
             number of non-zeros in U:                631031607
             number of non-zeros in L+U:              1304966948

=== PARDISO: solving a real nonsymmetric system ===
Two-level factorization algorithm is turned ON


Summary: ( factorization phase )
================

Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct)                        : 53.846398 s
Time spent in allocation of internal data structures (malloc)    : 0.000878 s
Time spent in additional calculations                            : 0.000001 s
Total time spent                                                 : 53.847277 s

Statistics:
===========
Parallel Direct Factorization is running on 8 OpenMP

< Linear system Ax = b >
             number of equations:           1928754
             number of non-zeros in A:      46843184
             number of non-zeros in A (%): 0.001259

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 72
             number of independent subgraphs:  0
             number of supernodes:                    795666
             size of largest supernode:               9159
             number of non-zeros in L:                673935341
             number of non-zeros in U:                631031607
             number of non-zeros in L+U:              1304966948
             gflop   for the numerical factorization: 2903.934836

             gflop/s for the numerical factorization: 53.929973

 

 

 

 

****************** Threadripper 2990wx *********************************


=== PARDISO: solving a real nonsymmetric system ===
0-based array is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON


Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.919861 s
Time spent in reordering of the initial matrix (reorder)         : 10.085178 s
Time spent in symbolic factorization (symbfct)                   : 2.207123 s
Time spent in data preparations for factorization (parlist)      : 0.101967 s
Time spent in allocation of internal data structures (malloc)    : 3.143640 s
Time spent in additional calculations                            : 3.677500 s
Total time spent                                                 : 20.135269 s

Statistics:
===========
Parallel Direct Factorization is running on 32 OpenMP

< Linear system Ax = b >
             number of equations:           1928754
             number of non-zeros in A:      46843184
             number of non-zeros in A (%): 0.001259

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 72
             number of independent subgraphs:  0
             number of supernodes:                    794723
             size of largest supernode:               7005
             number of non-zeros in L:                683894639
             number of non-zeros in U:                640539323
             number of non-zeros in L+U:              1324433962


=== PARDISO: solving a real nonsymmetric system ===
Two-level factorization algorithm is turned ON


Summary: ( factorization phase )
================

Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct)                        : 61.520888 s
Time spent in allocation of internal data structures (malloc)    : 0.001112 s
Time spent in additional calculations                            : 0.000002 s
Total time spent                                                 : 61.522003 s

Statistics:
===========
Parallel Direct Factorization is running on 32 OpenMP

< Linear system Ax = b >
             number of equations:           1928754
             number of non-zeros in A:      46843184
             number of non-zeros in A (%): 0.001259

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 72
             number of independent subgraphs:  0
             number of supernodes:                    794723
             size of largest supernode:               7005
             number of non-zeros in L:                683894639
             number of non-zeros in U:                640539323
             number of non-zeros in L+U:              1324433962
             gflop   for the numerical factorization: 2879.931235

             gflop/s for the numerical factorization: 46.812250

 

Nearly 2 million unknowns should have enough work for each core. Manually specifying a max of 16 threads shows a modest speedup (53 seconds for numerical factorization), which suggests to me that this is a Pardiso scaling issue and not a hardware issue. Although, it may be due to the memory architecture of the 2990wx.

Any suggestions?

 

0 Kudos
3 Replies
Gennady_F_Intel
Moderator
567 Views
summarizing: 
number of equations:           1928754
number of non-zeros in A:      46843184
Rizen: 8 threads,   Total time : 53.9 sec
Threadripper 2990wx:    32 threads, Total time : 61.5 sec

 

is that correct?

This could be some problem within mkl pardiso.    

Could you try to take some blas ( dgemm, as an example ) function and run the test on both of these systems with set MKL_VERBOSE=1 mode and share the output?   

The output will show which MKL branch of the code has been called. 

 

0 Kudos
Makhija__David
Beginner
567 Views

Gennady,

Is this what you were looking for? I added some timing/scaling tests as well. The matrix might be a bit different than the one in the original post. 

 

********* Ryzen 7 (8 core) ******************

MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Architecture processors, Win 3.28GHz cdecl intel_thread
MKL_VERBOSE DGEMM(N,N,1000,2000,200,0000001B6551F8C0,00000149531FA080,1000,0000014952ED6080,200,0000001B6551F8E8,000001495339D080,1000) 105.55ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:8
 
Number of equations is 1930194
 
Scaling for factorization
Core Count    Time    Theoretical   Observed
    2            179.462         2         2.16092
    4            101.963         4         3.80336
    6            82.1742         6         4.71928
    8            73.6192         8         5.26769
 
 
 

************ 2990wx (32 core) ****************

MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Architecture processors, Win 3.68GHz cdecl intel_thread
MKL_VERBOSE DGEMM(N,N,1000,2000,200,0000008F20D7F6B0,00000270E5FD2080,1000,00000270E5CBC080,200,0000008F20D7F6D8,00000270E616E080,1000) 116.24ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:32
 
Number of equations is 1930194
 
Scaling for factorization
Core Count    Time    Theoretical   Observed
    2            179.644         2         2.1443
    4            103.245         4         3.73103
    6            76.1586         6         5.05802
    8            63.5531         8         6.06125
    10            57.3066         10         6.72193
    12            57.6335         12         6.68381
    14            57.4599         14         6.704
    16            58.8203         16         6.54895
    18            62.5855         18         6.15496
    20            63.6602         20         6.05106
    22            64.9800         22         5.92815
    24            67.3924         24         5.71594
    26            68.8016         26         5.59887
    28            68.8782         28         5.59264
    30            65.4657         30         5.88417
    32            70.5859         32         5.45734

 

0 Kudos
Makhija__David
Beginner
567 Views

Gennady,

I've tried a handful of different parameters including changing iparm[10] and iparm[12] to 0 with iparm[23]=10. This should enable a two-level factorization algorithm for scalability with less reliability from removing the scaling and weighted matching. The factorization is a bit faster but it is far from scalable.

I think I will leave this alone at this point. I have to imagine scalability was considered for the cluster version of Pardiso. Is there anything that can get ported to the shared memory version?

Thanks for your help.

 

0 Kudos
Reply