Pardiso Threadripper 2990wx versus Ryzen 1700

Makhija__David · ‎09-22-2018

I have the same multi-physics finite element code generating a matrix. An old machine with a Ryzen 1700 (8 core) is faster than a threadripper 2990wx (32 core). Windows 10, intel64, mkl_rt.lib, and the MKL versions are 2018.1.156 for Ryzen 1700 and 2019.0.117 for Threadripper. I can provide an example matrix if it helps. Here are the options, which are same on both builds:

struct pardiso_struct

{

void *pt[64];

int maxfct{ 1 };

int mnum{ 1 };

int mtype{ 11 };

int n{ 0 };

int idum{ 0 }; //dummy not used by PARDISO when iparm(5-1) != 1

int nrhs{ 1 };

int iparm[64];

int msglvl{ 1 };

double ddum{ 0. };

int error{ 0 };

pardiso_struct()

{

// fill(pt, pt + 64, void(0)); does not work

for (int i = 0; i < 64; ++i)

pt = 0;

std::fill(iparm, iparm + 64, 0);

iparm[0] = 1; // 0 for all default, !=0 for any custom

iparm[1] = 3; // 0 minimum degree alg, metis, 3 openMP metis

//iparm[2] // reserved

iparm[3] = 0; // For iterative methods

iparm[4] = 0; // user fill-in reducing permutation

iparm[5] = 0; // 0 - solution written on x, 1 - solution on b

//iparm[6] output of number of iterative refinement steps

iparm[7] = 0; // iterative refinement steps

//iparm[8] reserved

iparm[9] = 13; // pivoting, 13 for nonsymmetric, 8 for sym

iparm[10] = 1; // 0 no scaling, 1 scaling (1 Default for nonsym)

iparm[12] = 1; // 0 to disable weighted matching? 1 default for non-sym

//iparm[13]-iparm[19] outputs

//iparm[20] = special pivoting for symmetric but indefinite

//iparm[21] output for number of pos eigs

//iparm[22] output for number of neg eigs

iparm[23] = 1; // 0 for classic alg, 1 for openMP scalable > 8 procs

iparm[24] = 0; // 0 for parallel solve, 1 for sequential solve

//iparm[25] // reserved

iparm[26] = 0; // 0 Do not check sparse mat, 1 check sparse mat

iparm[27] = 0; // 0 double precision, 1 single precision

//iparm[28] reserved;

//iparm[29] output zero or neg pivots in sym

//iparm[30] only solve for certain components...?

//iparm[31][32] reserved

//iparm[33] some reproduceability stuff

iparm[34] = 1; //0 one based indexing, 1 zero based indexing

//iparm[35] something with schur complements

iparm[36] = 0; //0 CSR, >0 BSR, <0 convert to BSR

//iparm[59] ooc options

}

};

The results of reorder and factorization are here. Solve (omitted here) is slower on 2990wx but the main concern is numerical factorization time.

*************** Ryzen 7 1700 **********************

=== PARDISO: solving a real nonsymmetric system ===
0-based array is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.847928 s
Time spent in reordering of the initial matrix (reorder) : 7.678907 s
Time spent in symbolic factorization (symbfct) : 2.075314 s
Time spent in data preparations for factorization (parlist) : 0.098494 s
Time spent in allocation of internal data structures (malloc) : 4.281882 s
Time spent in additional calculations : 3.785140 s
Total time spent : 18.767665 s

Statistics:
===========
Parallel Direct Factorization is running on 8 OpenMP

< Linear system Ax = b >
number of equations: 1928754
number of non-zeros in A: 46843184
number of non-zeros in A (%): 0.001259

number of right-hand sides: 1

< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
number of supernodes: 795666
size of largest supernode: 9159
number of non-zeros in L: 673935341
number of non-zeros in U: 631031607
number of non-zeros in L+U: 1304966948

=== PARDISO: solving a real nonsymmetric system ===
Two-level factorization algorithm is turned ON

Summary: ( factorization phase )
================

Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct) : 53.846398 s
Time spent in allocation of internal data structures (malloc) : 0.000878 s
Time spent in additional calculations : 0.000001 s
Total time spent : 53.847277 s

Statistics:
===========
Parallel Direct Factorization is running on 8 OpenMP

< Linear system Ax = b >
number of equations: 1928754
number of non-zeros in A: 46843184
number of non-zeros in A (%): 0.001259

number of right-hand sides: 1

< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
number of supernodes: 795666
size of largest supernode: 9159
number of non-zeros in L: 673935341
number of non-zeros in U: 631031607
number of non-zeros in L+U: 1304966948
gflop for the numerical factorization: 2903.934836

gflop/s for the numerical factorization: 53.929973

****************** Threadripper 2990wx *********************************

=== PARDISO: solving a real nonsymmetric system ===
0-based array is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.919861 s
Time spent in reordering of the initial matrix (reorder) : 10.085178 s
Time spent in symbolic factorization (symbfct) : 2.207123 s
Time spent in data preparations for factorization (parlist) : 0.101967 s
Time spent in allocation of internal data structures (malloc) : 3.143640 s
Time spent in additional calculations : 3.677500 s
Total time spent : 20.135269 s

Statistics:
===========
Parallel Direct Factorization is running on 32 OpenMP

< Linear system Ax = b >
number of equations: 1928754
number of non-zeros in A: 46843184
number of non-zeros in A (%): 0.001259

number of right-hand sides: 1

< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
number of supernodes: 794723
size of largest supernode: 7005
number of non-zeros in L: 683894639
number of non-zeros in U: 640539323
number of non-zeros in L+U: 1324433962

=== PARDISO: solving a real nonsymmetric system ===
Two-level factorization algorithm is turned ON

Summary: ( factorization phase )
================

Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct) : 61.520888 s
Time spent in allocation of internal data structures (malloc) : 0.001112 s
Time spent in additional calculations : 0.000002 s
Total time spent : 61.522003 s

Statistics:
===========
Parallel Direct Factorization is running on 32 OpenMP

< Linear system Ax = b >
number of equations: 1928754
number of non-zeros in A: 46843184
number of non-zeros in A (%): 0.001259

number of right-hand sides: 1

< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
number of supernodes: 794723
size of largest supernode: 7005
number of non-zeros in L: 683894639
number of non-zeros in U: 640539323
number of non-zeros in L+U: 1324433962
gflop for the numerical factorization: 2879.931235

gflop/s for the numerical factorization: 46.812250

Nearly 2 million unknowns should have enough work for each core. Manually specifying a max of 16 threads shows a modest speedup (53 seconds for numerical factorization), which suggests to me that this is a Pardiso scaling issue and not a hardware issue. Although, it may be due to the memory architecture of the 2990wx.

Any suggestions?

Gennady_F_Intel · ‎09-28-2018

summarizing:

number of equations: 1928754

number of non-zeros in A: 46843184

Rizen: 8 threads, Total time : 53.9 sec

Threadripper 2990wx: 32 threads, Total time : 61.5 sec

is that correct?

This could be some problem within mkl pardiso.

Could you try to take some blas ( dgemm, as an example ) function and run the test on both of these systems with set MKL_VERBOSE=1 mode and share the output?

The output will show which MKL branch of the code has been called.

Makhija__David · ‎09-29-2018

Gennady,

Is this what you were looking for? I added some timing/scaling tests as well. The matrix might be a bit different than the one in the original post.

********* Ryzen 7 (8 core) ******************

MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Architecture processors, Win 3.28GHz cdecl intel_thread

MKL_VERBOSE DGEMM(N,N,1000,2000,200,0000001B6551F8C0,00000149531FA080,1000,0000014952ED6080,200,0000001B6551F8E8,000001495339D080,1000) 105.55ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8

Number of equations is 1930194

Scaling for factorization

Core Count Time Theoretical Observed

2 179.462 2 2.16092

4 101.963 4 3.80336

6 82.1742 6 4.71928

8 73.6192 8 5.26769

************ 2990wx (32 core) ****************

MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Architecture processors, Win 3.68GHz cdecl intel_thread

MKL_VERBOSE DGEMM(N,N,1000,2000,200,0000008F20D7F6B0,00000270E5FD2080,1000,00000270E5CBC080,200,0000008F20D7F6D8,00000270E616E080,1000) 116.24ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:32

Number of equations is 1930194

Scaling for factorization

Core Count Time Theoretical Observed

2 179.644 2 2.1443

4 103.245 4 3.73103

6 76.1586 6 5.05802

8 63.5531 8 6.06125

10 57.3066 10 6.72193

12 57.6335 12 6.68381

14 57.4599 14 6.704

16 58.8203 16 6.54895

18 62.5855 18 6.15496

20 63.6602 20 6.05106

22 64.9800 22 5.92815

24 67.3924 24 5.71594

26 68.8016 26 5.59887

28 68.8782 28 5.59264

30 65.4657 30 5.88417

32 70.5859 32 5.45734

Makhija__David · ‎10-17-2018

Gennady,

I've tried a handful of different parameters including changing iparm[10] and iparm[12] to 0 with iparm[23]=10. This should enable a two-level factorization algorithm for scalability with less reliability from removing the scaling and weighted matching. The factorization is a bit faster but it is far from scalable.

I think I will leave this alone at this point. I have to imagine scalability was considered for the cluster version of Pardiso. Is there anything that can get ported to the shared memory version?

Thanks for your help.