Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6878 Discussions

Intel MKL Pardiso: Multithreaded Performance Lags Behind Sequential

CodingInDelphiIn2023
1,151 Views

Hello,

We've integrated Intel MKL_RT.dll into a Delphi Pascal application, particularly calling the Pardiso function. Apart from setting the number of cores for sequential execution, all other MKL settings are at default. Our benchmark tests factorize and solve approximately 20,000 systems of identical size, but with slight value variations.

Benchmark results:

  • With our current implementation: 2min 5.01 seconds
  • With Single Threaded Intel MKL Pardiso (MKL_SET_NUM_THREADS(1)): 2min 48.85 seconds
  • With default MKL settings (20 threads): 4min 2.83 seconds

Note: These times reflect only the factorization and solving portions. We had expected Pardiso to perform better, especially since our in-house solver operates sequentially. We wonder if our problem size may be too minimal.

 

General Information

These are our general settings:
mtype = 11

nrhs = 1

maxfct = 1

mnum = 1

msglvl = 0

 

System Details:

  • MKL version: Latest
  • Hardware: Intel i7-12700h with 16GB RAM

 

Parameter settings:

Singlethreaded

MKL_SET_NUM_THREADS(1)
iparm[0]:=1; // No solver default */
iparm[1]:=0; // 0=Minimum degree ordering, 2=Fill-in reordering from METIS */
iparm[7]:=1; // Max numbers of iterative refinement steps */
iparm[9]:=13; // Perturb the pivot elements with 1E-13 */
iparm[10]:=1; // Use nonsymmetric permutation and scaling MPS */
iparm[24]:=0; // uses new two-level factorization algorithm

 

Multithreaded:
iparm[0]:=1; // No solver default */
iparm[1]:=2; // 0=Minimum degree ordering, 2=Fill-in reordering from METIS */
iparm[7]:=1; // Max numbers of iterative refinement steps */
iparm[9]:=13; // Perturb the pivot elements with 1E-13 */
iparm[10]:=1; // Use nonsymmetric permutation and scaling MPS */
iparm[24]:=10; // uses new two-level factorization algorithm

 

We also captured a snapshot of matrix statistics and timings during one of our test runs (using single-threaded mode). The metrics for other matrices are similar.

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.000071 s
Time spent in reordering of the initial matrix (reorder) : 0.000127 s
Time spent in symbolic factorization (symbfct) : 0.001805 s
Time spent in data preparations for factorization (parlist) : 0.000004 s
Time spent in allocation of internal data structures (malloc) : 0.001410 s
Time spent in matching/scaling : 0.000005 s
Time spent in additional calculations : 0.000170 s
Total time spent : 0.003592 s

Statistics:
===========
Parallel Direct Factorization is running on 1 OpenMP

< Linear system Ax = b >
number of equations: 1350
number of non-zeros in A: 8108
number of non-zeros in A (%): 0.444883

number of right-hand sides: 1

< Factors L and U >
< Preprocessing with multiple minimum degree, tree height >
< Reduction for efficient parallel factorization >
number of columns for each panel: 128
number of independent subgraphs: 0
number of supernodes: 674
size of largest supernode: 4
number of non-zeros in L: 5464
number of non-zeros in U: 2756
number of non-zeros in L+U: 8220

 

=== PARDISO: solving a real nonsymmetric system ===
Single-level factorization algorithm is turned ON


Summary: ( starting phase is factorization, ending phase is solution )
================

Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct) : 0.009434 s
Time spent in direct solver at solve step (solve) : 0.000126 s
Time spent in allocation of internal data structures (malloc) : 0.000451 s
Time spent in additional calculations : 0.000014 s
Total time spent : 0.010027 s

Statistics:
===========
Parallel Direct Factorization is running on 1 OpenMP

< Linear system Ax = b >
number of equations: 1350
number of non-zeros in A: 8108
number of non-zeros in A (%): 0.444883

number of right-hand sides: 1

< Factors L and U >
< Preprocessing with multiple minimum degree, tree height >
< Reduction for efficient parallel factorization >
number of columns for each panel: 128
number of independent subgraphs: 0
number of supernodes: 674
size of largest supernode: 4
number of non-zeros in L: 5464
number of non-zeros in U: 2756
number of non-zeros in L+U: 8220
gflop for the numerical factorization: 0.000025

gflop/s for the numerical factorization: 0.002681


We're seeking insights from those experienced with Intel MKL Pardiso. Specifically, we're puzzled as to why Pardiso appears optimal when restricted to just one core. If further testing or additional information is required, I'm more than willing to provide it. Thanks for your assistance.

0 Kudos
19 Replies
ShanmukhS_Intel
Moderator
1,017 Views

Hi Mitch,


Thanks for posting in Intel Communities and elaborating on your issue.


We would like to request you share with us a sample reproducer so that we could investigate your issue in our environment and assist you accordingly based on the feasibility.


Best Regards,

Shanmukh.SS


0 Kudos
CodingInDelphiIn2023
1,011 Views

Hey Shanmukh,

 

Thank you for your reply, I hope to deliver a reproducer somewhere early next week.

 

Have a nice weekend,

Mitch

0 Kudos
CodingInDelphiIn2023
968 Views

Hey @ShanmukhS_Intel  hope you had a good weekend,

 

I'm working on the reproducer right now. To do this easily without taking large parts of our applications code I'm dumping IA/JA/A to .txt files, so that I can load them via a console application testapp. We're basically calculating for a simulation that changes slightly over time and is affected by the results of the previous calculation. (This is why we can't run it in parallel on a higher level).

 

Regardless my question to you is I currently have 29.704 files (with IA JA and A in them) for 14.5gb total, I'm currently compressing it and I'm expecting it to end up around 3gb. I can imagine this is not acceptable to you. Could you let me know how far you want me to limit one of our calculations?

 

Regards,

Mitch

 

0 Kudos
ShanmukhS_Intel
Moderator
896 Views

Hi Mitch,

 

Generally, an isolated C or Fortran reproducer that could help us in reproducing the issue is fine for us. It would be helpful if you compress and share with a smaller size as possible from your side. Please make sure to share C or Fortran code as per the MKL requirements mentioned below.

 

https://www.intel.com/content/www/us/en/developer/articles/system-requirements/oneapi-math-kernel-library-system-requirements.html

 

Best Regards,

Shanmukh.SS

 

0 Kudos
ShanmukhS_Intel
Moderator
852 Views

Hi,


A gentle reminder:

Could you please get back to us with the earlier requested details so that we could look into your issue further?


Best Regards,

Shanmukh.SS


0 Kudos
CodingInDelphiIn2023
829 Views

Apologies we had some critical issues that took priority past week. It's basically finished I just need to slim it down. I hope to deliver it tomorrow or wednesday at latest. Thank you for your patience

0 Kudos
ShanmukhS_Intel
Moderator
791 Views

Hi,

 

As mentioned earlier, we would require a C or Fortran reproducer that could help us in reproducing the issue as per the development environments mentioned below.

 

https://www.intel.com/content/www/us/en/developer/articles/system-requirements/oneapi-math-kernel-library-system-requirements.html

 

Best Regards,

Shanmukh.SS

 

0 Kudos
CodingInDelphiIn2023
768 Views

- I removed this post since I misconfigured this benchmark.

0 Kudos
CodingInDelphiIn2023
667 Views

Hey ShanmukhS,

 

Is everything in order with the reproducer / do you have any news?

0 Kudos
CodingInDelphiIn2023
744 Views

I managed to recreate the issue.


Shared settings:

iparm[0] = 1;
iparm[1] = 0;
iparm[7] = 1;
iparm[9] = 13;
iparm[10] = 1;
iparm[17] = -1;
iparm[18] = -1;
iparm[23] = 0;
iparm[24] = 1;

 

On SingleThreaded - MKL_Set_Num_Threads(1)

On MultiThreaded - MKL_Set_Num_Threads(14) (Number of physical cores I have, 6 performance cores with 2 threads each. 8 efficiency cores)

Setting it to 14 leads to better results then 20. Which I remember also reading somewhere in the documentation.

 

Delphi Pascal settings:
In Delphi I'm also explicitly calling 

MKL_SET_INTERFACE_LAYER(0);
MKL_SET_THREADING_LAYER(0);
MKL_SET_NUM_THREADS(14);
MKL_SET_MPI(2);
MKL_SET_DYNAMIC(1);

 

The benchmark results with the settings as written above for 1000 input files:
C:

Total single-threaded time (static MKL): 2.399109 seconds

Total multi-threaded time (static MKL): 4.694244 seconds

Total single-threaded time (dynamic MKL_RT.2.DLL): 2.175781 seconds

Total multi-threaded time (dynamic MKL_RT.2.DLL): 5.743867 seconds

 

Delphi Pascal:

Total single-threaded time (dynamic MKL_RT.2.DLL): 2.3117693 seconds

Total multi-threaded time (dynamic MKL_RT.2.DLL): 4.2820304 seconds

 

One can see that these findings are a lot more in line with each other. Attached are 100 of our input files which were dumped to .txt from Delphi and the C source. The C file looks for the input files at \MKL_Input\{d}.txt The resulting benchmark ran on my system results are:

Total single-threaded time (static MKL): 0.424015 seconds
Total multi-threaded time (static MKL): 0.828858 seconds
Total single-threaded time (dynamic MKL_RT.2.DLL): 0.227240 seconds
Total multi-threaded time (dynamic MKL_RT.2.DLL): 0.607243 seconds

 

0 Kudos
ShanmukhS_Intel
Moderator
623 Views

Hi Mitch,


Thanks for the details. Could you please confirm the OS environment details/ IDE being used and any steps to reproduce the issue?


Best Regards,

Shanmukh.SS


0 Kudos
CodingInDelphiIn2023
607 Views

Software:
IDE: Visual Studio 2022

Compiler: Intel C++ Compiler 2023

OS: Windows 11 Pro


Hardware:

Dell Precision 5570 laptop

CPU: Intel i7-12700h 

RAM: 16gb

 

Some further IDE settings. Please feel free to let me know if additional information would be helpful.

I also attached the entire project in case it's helpful.

CodingInDelphiIn2023_0-1696921556792.png

CodingInDelphiIn2023_1-1696921572453.png

CodingInDelphiIn2023_2-1696921703276.png

CodingInDelphiIn2023_3-1696921734044.png

CodingInDelphiIn2023_4-1696921753352.png

 

0 Kudos
CodingInDelphiIn2023
580 Views

MKL 2023.2.1 (latest) by the way 

0 Kudos
ShanmukhS_Intel
Moderator
488 Views

Hi Mitch,

 

Thanks for sharing the project file. We have tried compiling and running the same. Below are the results for your reference. We are looking into your issue. Could you please get back to us with the expected results?

 

Failed to open file: MKL_INPUT\101.txt

Total single-threaded time (static MKL): 1.754569 seconds

Total multi-threaded time (static MKL): 4.244060 seconds

Failed to open file: MKL_INPUT\101.txt

Total single-threaded time (dynamic MKL_RT.2.DLL): 0.881172 seconds

Total multi-threaded time (dynamic MKL_RT.2.DLL): 0.748737 seconds

 

Best Regards,

Shanmukh.SS

 

0 Kudos
CodingInDelphiIn2023
437 Views

Hey Shanmukh,

 

System Differences

It appears that we're running these benchmarks on different systems. I'm using a 12700h, which has14 logical processors, with 6 performance cores that support two threads each. In comparison, your single-threaded performance is approximately 3.5 times slower, and the multithreaded performance on static MKL is around 5 times slower. Intriguingly, you see a performance boost with the multithreaded version of the dynamic MKL_RT.2.DLL, while mine slows down by a factor of 3.

Here's a breakdown of the times I recorded:

  • Total single-threaded time (static MKL): 0.424015 seconds
  • Total multi-threaded time (static MKL): 0.828858 seconds
  • Total single-threaded time (dynamic MKL_RT.2.DLL): 0.227240 seconds
  • Total multi-threaded time (dynamic MKL_RT.2.DLL): 0.607243 seconds

Expected Results & Comparisons

For some context, our in-house solver, which is written in Delphi Pascal, often has a performance rate of about 2x-2.5x slower than C++/Inline assembly. Nonetheless, it currently solves problems faster than the MKL library in single-threaded mode. Our interest in MKL is due to its multithreading support. We'd ideally want the Pardiso multithreaded performance to be enhanced by at least 40% (a multiplier of only x1.4, which is modest for parallelization). If achieved, it could surpass our in-house solver, making MKL a potential integration for our systems. Also any further improvement from the x1.4 would be a pure gain.

Seeking Explanation

Any insights into why the multithreaded performance lags behind the single-threaded version would be greatly appreciated. We're keen to understand if the problem size is the primary factor or if there are other underlying reasons.

0 Kudos
ShanmukhS_Intel
Moderator
356 Views

Hi Mitch,

 

It seems the problem being solved is not big enough to take full advantage of the performance benefits offered by using multiple threads with the MKL Pardiso subroutine. The use of multithreading is not resulting in a noticeable improvement due to the relatively small size of the problem.

 

Best Regards,

Shanmukh.SS

 

0 Kudos
ShanmukhS_Intel
Moderator
192 Views

Hi,


A gentle reminder:

Could you please let us know if there are any updates on your issue?


Best Regards,

Shanmukh.SS


0 Kudos
CodingInDelphiIn2023
184 Views

Hey,

 

Not really, I mean I still wouldn't expect that the Dynamic setting multithreaded would be so detrimental compared to setting it to a singlecore specifically. The problem size might be small, but surely the overhead of dynamic parallelism doesn't lead to a 2x-3x slowdown?

0 Kudos
ShanmukhS_Intel
Moderator
23 Views

Hi Mitch,


We are looking into your issue internally. We will get back to you soon with an update.


Best Regards,

Shanmukh.SS


0 Kudos
Reply