Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

## Pardiso Threadripper 2990wx versus Ryzen 1700

Beginner
807 Views

I have the same multi-physics finite element code generating a matrix. An old machine with a Ryzen 1700 (8 core) is faster than a threadripper 2990wx (32 core). Windows 10, intel64, mkl_rt.lib, and the MKL versions are 2018.1.156 for Ryzen 1700 and 2019.0.117 for Threadripper. I can provide an example matrix if it helps. Here are the options, which are same on both builds:

struct pardiso_struct
{
void *pt[64];
int maxfct{ 1 };
int mnum{ 1 };
int mtype{ 11 };
int n{ 0 };
int idum{ 0 }; //dummy not used by PARDISO when iparm(5-1) != 1
int nrhs{ 1 };
int iparm[64];
int msglvl{ 1 };
double ddum{ 0. };
int error{ 0 };

pardiso_struct()
{
// fill(pt, pt + 64, void(0)); does not work
for (int i = 0; i < 64; ++i)
pt = 0;
std::fill(iparm, iparm + 64, 0);
iparm[0] = 1; // 0 for all default, !=0 for any custom
iparm[1] = 3; // 0 minimum degree alg, metis, 3 openMP metis
//iparm[2] // reserved
iparm[3] = 0; // For iterative methods
iparm[4] = 0; // user fill-in reducing permutation
iparm[5] = 0; // 0 - solution written on x, 1 - solution on b
//iparm[6] output of number of iterative refinement steps
iparm[7] = 0; // iterative refinement steps
//iparm[8] reserved
iparm[9] = 13; // pivoting, 13 for nonsymmetric, 8 for sym
iparm[10] = 1; // 0 no scaling, 1 scaling (1 Default for nonsym)

iparm[12] = 1; // 0 to disable weighted matching? 1 default for non-sym
//iparm[13]-iparm[19] outputs
//iparm[20] = special pivoting for symmetric but indefinite
//iparm[21] output for number of pos eigs
//iparm[22] output for number of neg eigs
iparm[23] = 1; // 0 for classic alg, 1 for openMP scalable > 8 procs
iparm[24] = 0; // 0 for parallel solve, 1 for sequential solve
//iparm[25] // reserved
iparm[26] = 0; // 0 Do not check sparse mat, 1 check sparse mat
iparm[27] = 0; // 0 double precision, 1 single precision
//iparm[28]  reserved;
//iparm[29] output zero or neg pivots in sym
//iparm[30] only solve for certain components...?
//iparm[31][32] reserved
//iparm[33] some reproduceability stuff
iparm[34] = 1; //0 one based indexing, 1 zero based indexing
//iparm[35] something with schur complements
iparm[36] = 0; //0 CSR, >0 BSR, <0 convert to BSR
//iparm[59] ooc options

}
};

The results of reorder and factorization are here. Solve (omitted here) is slower on 2990wx but the main concern is numerical factorization time.

*************** Ryzen 7 1700 **********************

=== PARDISO: solving a real nonsymmetric system ===
0-based array is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.847928 s
Time spent in reordering of the initial matrix (reorder)         : 7.678907 s
Time spent in symbolic factorization (symbfct)                   : 2.075314 s
Time spent in data preparations for factorization (parlist)      : 0.098494 s
Time spent in allocation of internal data structures (malloc)    : 4.281882 s
Time spent in additional calculations                            : 3.785140 s
Total time spent                                                 : 18.767665 s

Statistics:
===========
Parallel Direct Factorization is running on 8 OpenMP

< Linear system Ax = b >
number of equations:           1928754
number of non-zeros in A:      46843184
number of non-zeros in A (%): 0.001259

number of right-hand sides:    1

< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs:  0
number of supernodes:                    795666
size of largest supernode:               9159
number of non-zeros in L:                673935341
number of non-zeros in U:                631031607
number of non-zeros in L+U:              1304966948

=== PARDISO: solving a real nonsymmetric system ===
Two-level factorization algorithm is turned ON

Summary: ( factorization phase )
================

Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct)                        : 53.846398 s
Time spent in allocation of internal data structures (malloc)    : 0.000878 s
Time spent in additional calculations                            : 0.000001 s
Total time spent                                                 : 53.847277 s

Statistics:
===========
Parallel Direct Factorization is running on 8 OpenMP

< Linear system Ax = b >
number of equations:           1928754
number of non-zeros in A:      46843184
number of non-zeros in A (%): 0.001259

number of right-hand sides:    1

< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs:  0
number of supernodes:                    795666
size of largest supernode:               9159
number of non-zeros in L:                673935341
number of non-zeros in U:                631031607
number of non-zeros in L+U:              1304966948
gflop   for the numerical factorization: 2903.934836

gflop/s for the numerical factorization: 53.929973

=== PARDISO: solving a real nonsymmetric system ===
0-based array is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.919861 s
Time spent in reordering of the initial matrix (reorder)         : 10.085178 s
Time spent in symbolic factorization (symbfct)                   : 2.207123 s
Time spent in data preparations for factorization (parlist)      : 0.101967 s
Time spent in allocation of internal data structures (malloc)    : 3.143640 s
Time spent in additional calculations                            : 3.677500 s
Total time spent                                                 : 20.135269 s

Statistics:
===========
Parallel Direct Factorization is running on 32 OpenMP

< Linear system Ax = b >
number of equations:           1928754
number of non-zeros in A:      46843184
number of non-zeros in A (%): 0.001259

number of right-hand sides:    1

< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs:  0
number of supernodes:                    794723
size of largest supernode:               7005
number of non-zeros in L:                683894639
number of non-zeros in U:                640539323
number of non-zeros in L+U:              1324433962

=== PARDISO: solving a real nonsymmetric system ===
Two-level factorization algorithm is turned ON

Summary: ( factorization phase )
================

Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct)                        : 61.520888 s
Time spent in allocation of internal data structures (malloc)    : 0.001112 s
Time spent in additional calculations                            : 0.000002 s
Total time spent                                                 : 61.522003 s

Statistics:
===========
Parallel Direct Factorization is running on 32 OpenMP

< Linear system Ax = b >
number of equations:           1928754
number of non-zeros in A:      46843184
number of non-zeros in A (%): 0.001259

number of right-hand sides:    1

< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs:  0
number of supernodes:                    794723
size of largest supernode:               7005
number of non-zeros in L:                683894639
number of non-zeros in U:                640539323
number of non-zeros in L+U:              1324433962
gflop   for the numerical factorization: 2879.931235

gflop/s for the numerical factorization: 46.812250

Nearly 2 million unknowns should have enough work for each core. Manually specifying a max of 16 threads shows a modest speedup (53 seconds for numerical factorization), which suggests to me that this is a Pardiso scaling issue and not a hardware issue. Although, it may be due to the memory architecture of the 2990wx.

Any suggestions?

3 Replies
Moderator
807 Views
summarizing:
number of equations:           1928754
number of non-zeros in A:      46843184
Rizen: 8 threads,   Total time : 53.9 sec

is that correct?

This could be some problem within mkl pardiso.

Could you try to take some blas ( dgemm, as an example ) function and run the test on both of these systems with set MKL_VERBOSE=1 mode and share the output?

The output will show which MKL branch of the code has been called.

Beginner
807 Views

Is this what you were looking for? I added some timing/scaling tests as well. The matrix might be a bit different than the one in the original post.

********* Ryzen 7 (8 core) ******************

MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Architecture processors, Win 3.28GHz cdecl intel_thread
MKL_VERBOSE DGEMM(N,N,1000,2000,200,0000001B6551F8C0,00000149531FA080,1000,0000014952ED6080,200,0000001B6551F8E8,000001495339D080,1000) 105.55ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:8

Number of equations is 1930194

Scaling for factorization
Core Count    Time    Theoretical   Observed
2            179.462         2         2.16092
4            101.963         4         3.80336
6            82.1742         6         4.71928
8            73.6192         8         5.26769

************ 2990wx (32 core) ****************

MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Architecture processors, Win 3.68GHz cdecl intel_thread
MKL_VERBOSE DGEMM(N,N,1000,2000,200,0000008F20D7F6B0,00000270E5FD2080,1000,00000270E5CBC080,200,0000008F20D7F6D8,00000270E616E080,1000) 116.24ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:32

Number of equations is 1930194

Scaling for factorization
Core Count    Time    Theoretical   Observed
2            179.644         2         2.1443
4            103.245         4         3.73103
6            76.1586         6         5.05802
8            63.5531         8         6.06125
10            57.3066         10         6.72193
12            57.6335         12         6.68381
14            57.4599         14         6.704
16            58.8203         16         6.54895
18            62.5855         18         6.15496
20            63.6602         20         6.05106
22            64.9800         22         5.92815
24            67.3924         24         5.71594
26            68.8016         26         5.59887
28            68.8782         28         5.59264
30            65.4657         30         5.88417
32            70.5859         32         5.45734

Beginner
807 Views