Intel MKL Pardiso: Multithreaded Performance Lags Behind Sequential

CodingInDelphiIn2023 · ‎09-06-2023

Hello,

We've integrated Intel MKL_RT.dll into a Delphi Pascal application, particularly calling the Pardiso function. Apart from setting the number of cores for sequential execution, all other MKL settings are at default. Our benchmark tests factorize and solve approximately 20,000 systems of identical size, but with slight value variations.

Benchmark results:

With our current implementation: 2min 5.01 seconds
With Single Threaded Intel MKL Pardiso (MKL_SET_NUM_THREADS(1)): 2min 48.85 seconds
With default MKL settings (20 threads): 4min 2.83 seconds

Note: These times reflect only the factorization and solving portions. We had expected Pardiso to perform better, especially since our in-house solver operates sequentially. We wonder if our problem size may be too minimal.

General Information

These are our general settings:
mtype = 11

nrhs = 1

maxfct = 1

mnum = 1

msglvl = 0

System Details:

MKL version: Latest
Hardware: Intel i7-12700h with 16GB RAM

Parameter settings:

Singlethreaded

MKL_SET_NUM_THREADS(1)
iparm[0]:=1; // No solver default */
iparm[1]:=0; // 0=Minimum degree ordering, 2=Fill-in reordering from METIS */
iparm[7]:=1; // Max numbers of iterative refinement steps */
iparm[9]:=13; // Perturb the pivot elements with 1E-13 */
iparm[10]:=1; // Use nonsymmetric permutation and scaling MPS */
iparm[24]:=0; // uses new two-level factorization algorithm

Multithreaded:
iparm[0]:=1; // No solver default */
iparm[1]:=2; // 0=Minimum degree ordering, 2=Fill-in reordering from METIS */
iparm[7]:=1; // Max numbers of iterative refinement steps */
iparm[9]:=13; // Perturb the pivot elements with 1E-13 */
iparm[10]:=1; // Use nonsymmetric permutation and scaling MPS */
iparm[24]:=10; // uses new two-level factorization algorithm

We also captured a snapshot of matrix statistics and timings during one of our test runs (using single-threaded mode). The metrics for other matrices are similar.

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.000071 s
Time spent in reordering of the initial matrix (reorder) : 0.000127 s
Time spent in symbolic factorization (symbfct) : 0.001805 s
Time spent in data preparations for factorization (parlist) : 0.000004 s
Time spent in allocation of internal data structures (malloc) : 0.001410 s
Time spent in matching/scaling : 0.000005 s
Time spent in additional calculations : 0.000170 s
Total time spent : 0.003592 s

Statistics:
===========
Parallel Direct Factorization is running on 1 OpenMP

< Linear system Ax = b >
number of equations: 1350
number of non-zeros in A: 8108
number of non-zeros in A (%): 0.444883

number of right-hand sides: 1

< Factors L and U >
< Preprocessing with multiple minimum degree, tree height >
< Reduction for efficient parallel factorization >
number of columns for each panel: 128
number of independent subgraphs: 0
number of supernodes: 674
size of largest supernode: 4
number of non-zeros in L: 5464
number of non-zeros in U: 2756
number of non-zeros in L+U: 8220

=== PARDISO: solving a real nonsymmetric system ===
Single-level factorization algorithm is turned ON

Summary: ( starting phase is factorization, ending phase is solution )
================

Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct) : 0.009434 s
Time spent in direct solver at solve step (solve) : 0.000126 s
Time spent in allocation of internal data structures (malloc) : 0.000451 s
Time spent in additional calculations : 0.000014 s
Total time spent : 0.010027 s

Statistics:
===========
Parallel Direct Factorization is running on 1 OpenMP

< Linear system Ax = b >
number of equations: 1350
number of non-zeros in A: 8108
number of non-zeros in A (%): 0.444883

number of right-hand sides: 1

< Factors L and U >
< Preprocessing with multiple minimum degree, tree height >
< Reduction for efficient parallel factorization >
number of columns for each panel: 128
number of independent subgraphs: 0
number of supernodes: 674
size of largest supernode: 4
number of non-zeros in L: 5464
number of non-zeros in U: 2756
number of non-zeros in L+U: 8220
gflop for the numerical factorization: 0.000025

gflop/s for the numerical factorization: 0.002681

We're seeking insights from those experienced with Intel MKL Pardiso. Specifically, we're puzzled as to why Pardiso appears optimal when restricted to just one core. If further testing or additional information is required, I'm more than willing to provide it. Thanks for your assistance.

ShanmukhS_Intel · ‎09-08-2023

Hi Mitch,

Thanks for posting in Intel Communities and elaborating on your issue.

We would like to request you share with us a sample reproducer so that we could investigate your issue in our environment and assist you accordingly based on the feasibility.

Best Regards,

Shanmukh.SS

CodingInDelphiIn2023 · ‎09-08-2023

Hey Shanmukh,

Thank you for your reply, I hope to deliver a reproducer somewhere early next week.

Have a nice weekend,

Mitch

CodingInDelphiIn2023 · ‎09-11-2023

Hey @ShanmukhS_Intel hope you had a good weekend,

I'm working on the reproducer right now. To do this easily without taking large parts of our applications code I'm dumping IA/JA/A to .txt files, so that I can load them via a console application testapp. We're basically calculating for a simulation that changes slightly over time and is affected by the results of the previous calculation. (This is why we can't run it in parallel on a higher level).

Regardless my question to you is I currently have 29.704 files (with IA JA and A in them) for 14.5gb total, I'm currently compressing it and I'm expecting it to end up around 3gb. I can imagine this is not acceptable to you. Could you let me know how far you want me to limit one of our calculations?

Regards,

Mitch

ShanmukhS_Intel · ‎09-17-2023

Hi Mitch,

Generally, an isolated C or Fortran reproducer that could help us in reproducing the issue is fine for us. It would be helpful if you compress and share with a smaller size as possible from your side. Please make sure to share C or Fortran code as per the MKL requirements mentioned below.

https://www.intel.com/content/www/us/en/developer/articles/system-requirements/oneapi-math-kernel-library-system-requirements.html

Best Regards,

Shanmukh.SS

ShanmukhS_Intel · ‎09-22-2023

Hi,

A gentle reminder:

Could you please get back to us with the earlier requested details so that we could look into your issue further?

Best Regards,

Shanmukh.SS

CodingInDelphiIn2023 · ‎09-25-2023

Apologies we had some critical issues that took priority past week. It's basically finished I just need to slim it down. I hope to deliver it tomorrow or wednesday at latest. Thank you for your patience

ShanmukhS_Intel · ‎09-26-2023

Hi,

As mentioned earlier, we would require a C or Fortran reproducer that could help us in reproducing the issue as per the development environments mentioned below.

https://www.intel.com/content/www/us/en/developer/articles/system-requirements/oneapi-math-kernel-library-system-requirements.html

Best Regards,

Shanmukh.SS

CodingInDelphiIn2023 · ‎09-27-2023

- I removed this post since I misconfigured this benchmark.

CodingInDelphiIn2023 · ‎10-05-2023

Hey ShanmukhS,

Is everything in order with the reproducer / do you have any news?

CodingInDelphiIn2023 · ‎09-28-2023

I managed to recreate the issue.

Shared settings:

iparm[0] = 1;
iparm[1] = 0;
iparm[7] = 1;
iparm[9] = 13;
iparm[10] = 1;
iparm[17] = -1;
iparm[18] = -1;
iparm[23] = 0;
iparm[24] = 1;

On SingleThreaded - MKL_Set_Num_Threads(1)

On MultiThreaded - MKL_Set_Num_Threads(14) (Number of physical cores I have, 6 performance cores with 2 threads each. 8 efficiency cores)

Setting it to 14 leads to better results then 20. Which I remember also reading somewhere in the documentation.

Delphi Pascal settings:
In Delphi I'm also explicitly calling

MKL_SET_INTERFACE_LAYER(0);
MKL_SET_THREADING_LAYER(0);
MKL_SET_NUM_THREADS(14);
MKL_SET_MPI(2);
MKL_SET_DYNAMIC(1);

The benchmark results with the settings as written above for 1000 input files:
C:

Total single-threaded time (static MKL): 2.399109 seconds

Total multi-threaded time (static MKL): 4.694244 seconds

Total single-threaded time (dynamic MKL_RT.2.DLL): 2.175781 seconds

Total multi-threaded time (dynamic MKL_RT.2.DLL): 5.743867 seconds

Delphi Pascal:

Total single-threaded time (dynamic MKL_RT.2.DLL): 2.3117693 seconds

Total multi-threaded time (dynamic MKL_RT.2.DLL): 4.2820304 seconds

One can see that these findings are a lot more in line with each other. Attached are 100 of our input files which were dumped to .txt from Delphi and the C source. The C file looks for the input files at \MKL_Input\{d}.txt The resulting benchmark ran on my system results are:

Total single-threaded time (static MKL): 0.424015 seconds
Total multi-threaded time (static MKL): 0.828858 seconds
Total single-threaded time (dynamic MKL_RT.2.DLL): 0.227240 seconds
Total multi-threaded time (dynamic MKL_RT.2.DLL): 0.607243 seconds

ShanmukhS_Intel · ‎10-09-2023

Hi Mitch,

Thanks for the details. Could you please confirm the OS environment details/ IDE being used and any steps to reproduce the issue?

Best Regards,

Shanmukh.SS

CodingInDelphiIn2023 · ‎10-10-2023

Software:
IDE: Visual Studio 2022

Compiler: Intel C++ Compiler 2023

OS: Windows 11 Pro

Hardware:

Dell Precision 5570 laptop

CPU: Intel i7-12700h

RAM: 16gb

Some further IDE settings. Please feel free to let me know if additional information would be helpful.

I also attached the entire project in case it's helpful.

CodingInDelphiIn2023 · ‎10-11-2023

MKL 2023.2.1 (latest) by the way

ShanmukhS_Intel · ‎10-17-2023

Hi Mitch,

Thanks for sharing the project file. We have tried compiling and running the same. Below are the results for your reference. We are looking into your issue. Could you please get back to us with the expected results?

Failed to open file: MKL_INPUT\101.txt

Total single-threaded time (static MKL): 1.754569 seconds

Total multi-threaded time (static MKL): 4.244060 seconds

Failed to open file: MKL_INPUT\101.txt

Total single-threaded time (dynamic MKL_RT.2.DLL): 0.881172 seconds

Total multi-threaded time (dynamic MKL_RT.2.DLL): 0.748737 seconds

Best Regards,

Shanmukh.SS

CodingInDelphiIn2023 · ‎10-19-2023

Hey Shanmukh,

System Differences

It appears that we're running these benchmarks on different systems. I'm using a 12700h, which has14 logical processors, with 6 performance cores that support two threads each. In comparison, your single-threaded performance is approximately 3.5 times slower, and the multithreaded performance on static MKL is around 5 times slower. Intriguingly, you see a performance boost with the multithreaded version of the dynamic MKL_RT.2.DLL, while mine slows down by a factor of 3.

Here's a breakdown of the times I recorded:

Total single-threaded time (static MKL): 0.424015 seconds
Total multi-threaded time (static MKL): 0.828858 seconds
Total single-threaded time (dynamic MKL_RT.2.DLL): 0.227240 seconds
Total multi-threaded time (dynamic MKL_RT.2.DLL): 0.607243 seconds

Expected Results & Comparisons

For some context, our in-house solver, which is written in Delphi Pascal, often has a performance rate of about 2x-2.5x slower than C++/Inline assembly. Nonetheless, it currently solves problems faster than the MKL library in single-threaded mode. Our interest in MKL is due to its multithreading support. We'd ideally want the Pardiso multithreaded performance to be enhanced by at least 40% (a multiplier of only x1.4, which is modest for parallelization). If achieved, it could surpass our in-house solver, making MKL a potential integration for our systems. Also any further improvement from the x1.4 would be a pure gain.

Seeking Explanation

Any insights into why the multithreaded performance lags behind the single-threaded version would be greatly appreciated. We're keen to understand if the problem size is the primary factor or if there are other underlying reasons.

ShanmukhS_Intel · ‎10-23-2023

Hi Mitch,

It seems the problem being solved is not big enough to take full advantage of the performance benefits offered by using multiple threads with the MKL Pardiso subroutine. The use of multithreading is not resulting in a noticeable improvement due to the relatively small size of the problem.

Best Regards,

Shanmukh.SS

ShanmukhS_Intel · ‎11-14-2023

Hi,

A gentle reminder:

Could you please let us know if there are any updates on your issue?

Best Regards,

Shanmukh.SS

CodingInDelphiIn2023 · ‎11-14-2023

Hey,

Not really, I mean I still wouldn't expect that the Dynamic setting multithreaded would be so detrimental compared to setting it to a singlecore specifically. The problem size might be small, but surely the overhead of dynamic parallelism doesn't lead to a 2x-3x slowdown?

ShanmukhS_Intel · ‎12-04-2023

Hi Mitch,

Yeah, it is not recommended to use multi-threading mode in the case of small workloads. Hence we couldn't be able to consider this case as a real-time problem. Could you please let us know if you have any other queries?

Best Regards,

Shanmukh.SS

ShanmukhS_Intel · ‎12-26-2023

Hi Mitch,

A gentle reminder:

We haven't heard back from you. If you need further assistance, you could post a new community thread.

Best Regards,

Shanmukh.SS