- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have the same multi-physics finite element code generating a matrix. An old machine with a Ryzen 1700 (8 core) is faster than a threadripper 2990wx (32 core). Windows 10, intel64, mkl_rt.lib, and the MKL versions are 2018.1.156 for Ryzen 1700 and 2019.0.117 for Threadripper. I can provide an example matrix if it helps. Here are the options, which are same on both builds:
The results of reorder and factorization are here. Solve (omitted here) is slower on 2990wx but the main concern is numerical factorization time.
*************** Ryzen 7 1700 **********************
=== PARDISO: solving a real nonsymmetric system ===
0-based array is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON
Summary: ( reordering phase )
================
Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.847928 s
Time spent in reordering of the initial matrix (reorder) : 7.678907 s
Time spent in symbolic factorization (symbfct) : 2.075314 s
Time spent in data preparations for factorization (parlist) : 0.098494 s
Time spent in allocation of internal data structures (malloc) : 4.281882 s
Time spent in additional calculations : 3.785140 s
Total time spent : 18.767665 s
Statistics:
===========
Parallel Direct Factorization is running on 8 OpenMP
< Linear system Ax = b >
number of equations: 1928754
number of non-zeros in A: 46843184
number of non-zeros in A (%): 0.001259
number of right-hand sides: 1
< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
number of supernodes: 795666
size of largest supernode: 9159
number of non-zeros in L: 673935341
number of non-zeros in U: 631031607
number of non-zeros in L+U: 1304966948
=== PARDISO: solving a real nonsymmetric system ===
Two-level factorization algorithm is turned ON
Summary: ( factorization phase )
================
Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct) : 53.846398 s
Time spent in allocation of internal data structures (malloc) : 0.000878 s
Time spent in additional calculations : 0.000001 s
Total time spent : 53.847277 s
Statistics:
===========
Parallel Direct Factorization is running on 8 OpenMP
< Linear system Ax = b >
number of equations: 1928754
number of non-zeros in A: 46843184
number of non-zeros in A (%): 0.001259
number of right-hand sides: 1
< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
number of supernodes: 795666
size of largest supernode: 9159
number of non-zeros in L: 673935341
number of non-zeros in U: 631031607
number of non-zeros in L+U: 1304966948
gflop for the numerical factorization: 2903.934836
gflop/s for the numerical factorization: 53.929973
****************** Threadripper 2990wx *********************************
=== PARDISO: solving a real nonsymmetric system ===
0-based array is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON
Summary: ( reordering phase )
================
Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.919861 s
Time spent in reordering of the initial matrix (reorder) : 10.085178 s
Time spent in symbolic factorization (symbfct) : 2.207123 s
Time spent in data preparations for factorization (parlist) : 0.101967 s
Time spent in allocation of internal data structures (malloc) : 3.143640 s
Time spent in additional calculations : 3.677500 s
Total time spent : 20.135269 s
Statistics:
===========
Parallel Direct Factorization is running on 32 OpenMP
< Linear system Ax = b >
number of equations: 1928754
number of non-zeros in A: 46843184
number of non-zeros in A (%): 0.001259
number of right-hand sides: 1
< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
number of supernodes: 794723
size of largest supernode: 7005
number of non-zeros in L: 683894639
number of non-zeros in U: 640539323
number of non-zeros in L+U: 1324433962
=== PARDISO: solving a real nonsymmetric system ===
Two-level factorization algorithm is turned ON
Summary: ( factorization phase )
================
Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct) : 61.520888 s
Time spent in allocation of internal data structures (malloc) : 0.001112 s
Time spent in additional calculations : 0.000002 s
Total time spent : 61.522003 s
Statistics:
===========
Parallel Direct Factorization is running on 32 OpenMP
< Linear system Ax = b >
number of equations: 1928754
number of non-zeros in A: 46843184
number of non-zeros in A (%): 0.001259
number of right-hand sides: 1
< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
number of supernodes: 794723
size of largest supernode: 7005
number of non-zeros in L: 683894639
number of non-zeros in U: 640539323
number of non-zeros in L+U: 1324433962
gflop for the numerical factorization: 2879.931235
gflop/s for the numerical factorization: 46.812250
Nearly 2 million unknowns should have enough work for each core. Manually specifying a max of 16 threads shows a modest speedup (53 seconds for numerical factorization), which suggests to me that this is a Pardiso scaling issue and not a hardware issue. Although, it may be due to the memory architecture of the 2990wx.
Any suggestions?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
is that correct?
This could be some problem within mkl pardiso.
Could you try to take some blas ( dgemm, as an example ) function and run the test on both of these systems with set MKL_VERBOSE=1 mode and share the output?
The output will show which MKL branch of the code has been called.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Gennady,
Is this what you were looking for? I added some timing/scaling tests as well. The matrix might be a bit different than the one in the original post.
********* Ryzen 7 (8 core) ******************
************ 2990wx (32 core) ****************
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Gennady,
I've tried a handful of different parameters including changing iparm[10] and iparm[12] to 0 with iparm[23]=10. This should enable a two-level factorization algorithm for scalability with less reliability from removing the scaling and weighted matching. The factorization is a bit faster but it is far from scalable.
I think I will leave this alone at this point. I have to imagine scalability was considered for the cluster version of Pardiso. Is there anything that can get ported to the shared memory version?
Thanks for your help.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page