I have the same multi-physics finite element code generating a matrix. An old machine with a Ryzen 1700 (8 core) is faster than a threadripper 2990wx (32 core). Windows 10, intel64, mkl_rt.lib, and the MKL versions are 2018.1.156 for Ryzen 1700 and 2019.0.117 for Threadripper. I can provide an example matrix if it helps. Here are the options, which are same on both builds:
The results of reorder and factorization are here. Solve (omitted here) is slower on 2990wx but the main concern is numerical factorization time.
*************** Ryzen 7 1700 **********************
=== PARDISO: solving a real nonsymmetric system ===
0-based array is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON
Summary: ( reordering phase )
================
Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.847928 s
Time spent in reordering of the initial matrix (reorder) : 7.678907 s
Time spent in symbolic factorization (symbfct) : 2.075314 s
Time spent in data preparations for factorization (parlist) : 0.098494 s
Time spent in allocation of internal data structures (malloc) : 4.281882 s
Time spent in additional calculations : 3.785140 s
Total time spent : 18.767665 s
Statistics:
===========
Parallel Direct Factorization is running on 8 OpenMP
< Linear system Ax = b >
number of equations: 1928754
number of non-zeros in A: 46843184
number of non-zeros in A (%): 0.001259
number of right-hand sides: 1
< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
number of supernodes: 795666
size of largest supernode: 9159
number of non-zeros in L: 673935341
number of non-zeros in U: 631031607
number of non-zeros in L+U: 1304966948
=== PARDISO: solving a real nonsymmetric system ===
Two-level factorization algorithm is turned ON
Summary: ( factorization phase )
================
Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct) : 53.846398 s
Time spent in allocation of internal data structures (malloc) : 0.000878 s
Time spent in additional calculations : 0.000001 s
Total time spent : 53.847277 s
Statistics:
===========
Parallel Direct Factorization is running on 8 OpenMP
< Linear system Ax = b >
number of equations: 1928754
number of non-zeros in A: 46843184
number of non-zeros in A (%): 0.001259
number of right-hand sides: 1
< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
number of supernodes: 795666
size of largest supernode: 9159
number of non-zeros in L: 673935341
number of non-zeros in U: 631031607
number of non-zeros in L+U: 1304966948
gflop for the numerical factorization: 2903.934836
gflop/s for the numerical factorization: 53.929973
****************** Threadripper 2990wx *********************************
=== PARDISO: solving a real nonsymmetric system ===
0-based array is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON
Summary: ( reordering phase )
================
Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.919861 s
Time spent in reordering of the initial matrix (reorder) : 10.085178 s
Time spent in symbolic factorization (symbfct) : 2.207123 s
Time spent in data preparations for factorization (parlist) : 0.101967 s
Time spent in allocation of internal data structures (malloc) : 3.143640 s
Time spent in additional calculations : 3.677500 s
Total time spent : 20.135269 s
Statistics:
===========
Parallel Direct Factorization is running on 32 OpenMP
< Linear system Ax = b >
number of equations: 1928754
number of non-zeros in A: 46843184
number of non-zeros in A (%): 0.001259
number of right-hand sides: 1
< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
number of supernodes: 794723
size of largest supernode: 7005
number of non-zeros in L: 683894639
number of non-zeros in U: 640539323
number of non-zeros in L+U: 1324433962
=== PARDISO: solving a real nonsymmetric system ===
Two-level factorization algorithm is turned ON
Summary: ( factorization phase )
================
Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct) : 61.520888 s
Time spent in allocation of internal data structures (malloc) : 0.001112 s
Time spent in additional calculations : 0.000002 s
Total time spent : 61.522003 s
Statistics:
===========
Parallel Direct Factorization is running on 32 OpenMP
< Linear system Ax = b >
number of equations: 1928754
number of non-zeros in A: 46843184
number of non-zeros in A (%): 0.001259
number of right-hand sides: 1
< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
number of supernodes: 794723
size of largest supernode: 7005
number of non-zeros in L: 683894639
number of non-zeros in U: 640539323
number of non-zeros in L+U: 1324433962
gflop for the numerical factorization: 2879.931235
gflop/s for the numerical factorization: 46.812250
Nearly 2 million unknowns should have enough work for each core. Manually specifying a max of 16 threads shows a modest speedup (53 seconds for numerical factorization), which suggests to me that this is a Pardiso scaling issue and not a hardware issue. Although, it may be due to the memory architecture of the 2990wx.
Any suggestions?
Link Copied
is that correct?
This could be some problem within mkl pardiso.
Could you try to take some blas ( dgemm, as an example ) function and run the test on both of these systems with set MKL_VERBOSE=1 mode and share the output?
The output will show which MKL branch of the code has been called.
Gennady,
Is this what you were looking for? I added some timing/scaling tests as well. The matrix might be a bit different than the one in the original post.
********* Ryzen 7 (8 core) ******************
************ 2990wx (32 core) ****************
Gennady,
I've tried a handful of different parameters including changing iparm[10] and iparm[12] to 0 with iparm[23]=10. This should enable a two-level factorization algorithm for scalability with less reliability from removing the scaling and weighted matching. The factorization is a bit faster but it is far from scalable.
I think I will leave this alone at this point. I have to imagine scalability was considered for the cluster version of Pardiso. Is there anything that can get ported to the shared memory version?
Thanks for your help.
For more complete information about compiler optimizations, see our Optimization Notice.