- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
We've integrated Intel MKL_RT.dll into a Delphi Pascal application, particularly calling the Pardiso function. Apart from setting the number of cores for sequential execution, all other MKL settings are at default. Our benchmark tests factorize and solve approximately 20,000 systems of identical size, but with slight value variations.
Benchmark results:
- With our current implementation: 2min 5.01 seconds
- With Single Threaded Intel MKL Pardiso (MKL_SET_NUM_THREADS(1)): 2min 48.85 seconds
- With default MKL settings (20 threads): 4min 2.83 seconds
Note: These times reflect only the factorization and solving portions. We had expected Pardiso to perform better, especially since our in-house solver operates sequentially. We wonder if our problem size may be too minimal.
General Information
These are our general settings:
mtype = 11
nrhs = 1
maxfct = 1
mnum = 1
msglvl = 0
System Details:
- MKL version: Latest
- Hardware: Intel i7-12700h with 16GB RAM
Parameter settings:
Singlethreaded
MKL_SET_NUM_THREADS(1)
iparm[0]:=1; // No solver default */
iparm[1]:=0; // 0=Minimum degree ordering, 2=Fill-in reordering from METIS */
iparm[7]:=1; // Max numbers of iterative refinement steps */
iparm[9]:=13; // Perturb the pivot elements with 1E-13 */
iparm[10]:=1; // Use nonsymmetric permutation and scaling MPS */
iparm[24]:=0; // uses new two-level factorization algorithm
Multithreaded:
iparm[0]:=1; // No solver default */
iparm[1]:=2; // 0=Minimum degree ordering, 2=Fill-in reordering from METIS */
iparm[7]:=1; // Max numbers of iterative refinement steps */
iparm[9]:=13; // Perturb the pivot elements with 1E-13 */
iparm[10]:=1; // Use nonsymmetric permutation and scaling MPS */
iparm[24]:=10; // uses new two-level factorization algorithm
We also captured a snapshot of matrix statistics and timings during one of our test runs (using single-threaded mode). The metrics for other matrices are similar.
Summary: ( reordering phase )
================
Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.000071 s
Time spent in reordering of the initial matrix (reorder) : 0.000127 s
Time spent in symbolic factorization (symbfct) : 0.001805 s
Time spent in data preparations for factorization (parlist) : 0.000004 s
Time spent in allocation of internal data structures (malloc) : 0.001410 s
Time spent in matching/scaling : 0.000005 s
Time spent in additional calculations : 0.000170 s
Total time spent : 0.003592 s
Statistics:
===========
Parallel Direct Factorization is running on 1 OpenMP
< Linear system Ax = b >
number of equations: 1350
number of non-zeros in A: 8108
number of non-zeros in A (%): 0.444883
number of right-hand sides: 1
< Factors L and U >
< Preprocessing with multiple minimum degree, tree height >
< Reduction for efficient parallel factorization >
number of columns for each panel: 128
number of independent subgraphs: 0
number of supernodes: 674
size of largest supernode: 4
number of non-zeros in L: 5464
number of non-zeros in U: 2756
number of non-zeros in L+U: 8220
=== PARDISO: solving a real nonsymmetric system ===
Single-level factorization algorithm is turned ON
Summary: ( starting phase is factorization, ending phase is solution )
================
Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct) : 0.009434 s
Time spent in direct solver at solve step (solve) : 0.000126 s
Time spent in allocation of internal data structures (malloc) : 0.000451 s
Time spent in additional calculations : 0.000014 s
Total time spent : 0.010027 s
Statistics:
===========
Parallel Direct Factorization is running on 1 OpenMP
< Linear system Ax = b >
number of equations: 1350
number of non-zeros in A: 8108
number of non-zeros in A (%): 0.444883
number of right-hand sides: 1
< Factors L and U >
< Preprocessing with multiple minimum degree, tree height >
< Reduction for efficient parallel factorization >
number of columns for each panel: 128
number of independent subgraphs: 0
number of supernodes: 674
size of largest supernode: 4
number of non-zeros in L: 5464
number of non-zeros in U: 2756
number of non-zeros in L+U: 8220
gflop for the numerical factorization: 0.000025
gflop/s for the numerical factorization: 0.002681
We're seeking insights from those experienced with Intel MKL Pardiso. Specifically, we're puzzled as to why Pardiso appears optimal when restricted to just one core. If further testing or additional information is required, I'm more than willing to provide it. Thanks for your assistance.
- Tags:
- pardiso
- performance
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mitch,
Thanks for posting in Intel Communities and elaborating on your issue.
We would like to request you share with us a sample reproducer so that we could investigate your issue in our environment and assist you accordingly based on the feasibility.
Best Regards,
Shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey Shanmukh,
Thank you for your reply, I hope to deliver a reproducer somewhere early next week.
Have a nice weekend,
Mitch
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey @ShanmukhS_Intel hope you had a good weekend,
I'm working on the reproducer right now. To do this easily without taking large parts of our applications code I'm dumping IA/JA/A to .txt files, so that I can load them via a console application testapp. We're basically calculating for a simulation that changes slightly over time and is affected by the results of the previous calculation. (This is why we can't run it in parallel on a higher level).
Regardless my question to you is I currently have 29.704 files (with IA JA and A in them) for 14.5gb total, I'm currently compressing it and I'm expecting it to end up around 3gb. I can imagine this is not acceptable to you. Could you let me know how far you want me to limit one of our calculations?
Regards,
Mitch
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mitch,
Generally, an isolated C or Fortran reproducer that could help us in reproducing the issue is fine for us. It would be helpful if you compress and share with a smaller size as possible from your side. Please make sure to share C or Fortran code as per the MKL requirements mentioned below.
Best Regards,
Shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
A gentle reminder:
Could you please get back to us with the earlier requested details so that we could look into your issue further?
Best Regards,
Shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Apologies we had some critical issues that took priority past week. It's basically finished I just need to slim it down. I hope to deliver it tomorrow or wednesday at latest. Thank you for your patience
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
As mentioned earlier, we would require a C or Fortran reproducer that could help us in reproducing the issue as per the development environments mentioned below.
Best Regards,
Shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- I removed this post since I misconfigured this benchmark.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey ShanmukhS,
Is everything in order with the reproducer / do you have any news?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I managed to recreate the issue.
Shared settings:
iparm[0] = 1;
iparm[1] = 0;
iparm[7] = 1;
iparm[9] = 13;
iparm[10] = 1;
iparm[17] = -1;
iparm[18] = -1;
iparm[23] = 0;
iparm[24] = 1;
On SingleThreaded - MKL_Set_Num_Threads(1)
On MultiThreaded - MKL_Set_Num_Threads(14) (Number of physical cores I have, 6 performance cores with 2 threads each. 8 efficiency cores)
Setting it to 14 leads to better results then 20. Which I remember also reading somewhere in the documentation.
Delphi Pascal settings:
In Delphi I'm also explicitly calling
MKL_SET_INTERFACE_LAYER(0);
MKL_SET_THREADING_LAYER(0);
MKL_SET_NUM_THREADS(14);
MKL_SET_MPI(2);
MKL_SET_DYNAMIC(1);
The benchmark results with the settings as written above for 1000 input files:
C:
Total single-threaded time (static MKL): 2.399109 seconds
Total multi-threaded time (static MKL): 4.694244 seconds
Total single-threaded time (dynamic MKL_RT.2.DLL): 2.175781 seconds
Total multi-threaded time (dynamic MKL_RT.2.DLL): 5.743867 seconds
Delphi Pascal:
Total single-threaded time (dynamic MKL_RT.2.DLL): 2.3117693 seconds
Total multi-threaded time (dynamic MKL_RT.2.DLL): 4.2820304 seconds
One can see that these findings are a lot more in line with each other. Attached are 100 of our input files which were dumped to .txt from Delphi and the C source. The C file looks for the input files at \MKL_Input\{d}.txt The resulting benchmark ran on my system results are:
Total single-threaded time (static MKL): 0.424015 seconds
Total multi-threaded time (static MKL): 0.828858 seconds
Total single-threaded time (dynamic MKL_RT.2.DLL): 0.227240 seconds
Total multi-threaded time (dynamic MKL_RT.2.DLL): 0.607243 seconds
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mitch,
Thanks for the details. Could you please confirm the OS environment details/ IDE being used and any steps to reproduce the issue?
Best Regards,
Shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Software:
IDE: Visual Studio 2022
Compiler: Intel C++ Compiler 2023
OS: Windows 11 Pro
Hardware:
Dell Precision 5570 laptop
CPU: Intel i7-12700h
RAM: 16gb
Some further IDE settings. Please feel free to let me know if additional information would be helpful.
I also attached the entire project in case it's helpful.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MKL 2023.2.1 (latest) by the way
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mitch,
Thanks for sharing the project file. We have tried compiling and running the same. Below are the results for your reference. We are looking into your issue. Could you please get back to us with the expected results?
Failed to open file: MKL_INPUT\101.txt
Total single-threaded time (static MKL): 1.754569 seconds
Total multi-threaded time (static MKL): 4.244060 seconds
Failed to open file: MKL_INPUT\101.txt
Total single-threaded time (dynamic MKL_RT.2.DLL): 0.881172 seconds
Total multi-threaded time (dynamic MKL_RT.2.DLL): 0.748737 seconds
Best Regards,
Shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey Shanmukh,
System Differences
It appears that we're running these benchmarks on different systems. I'm using a 12700h, which has14 logical processors, with 6 performance cores that support two threads each. In comparison, your single-threaded performance is approximately 3.5 times slower, and the multithreaded performance on static MKL is around 5 times slower. Intriguingly, you see a performance boost with the multithreaded version of the dynamic MKL_RT.2.DLL, while mine slows down by a factor of 3.
Here's a breakdown of the times I recorded:
- Total single-threaded time (static MKL): 0.424015 seconds
- Total multi-threaded time (static MKL): 0.828858 seconds
- Total single-threaded time (dynamic MKL_RT.2.DLL): 0.227240 seconds
- Total multi-threaded time (dynamic MKL_RT.2.DLL): 0.607243 seconds
Expected Results & Comparisons
For some context, our in-house solver, which is written in Delphi Pascal, often has a performance rate of about 2x-2.5x slower than C++/Inline assembly. Nonetheless, it currently solves problems faster than the MKL library in single-threaded mode. Our interest in MKL is due to its multithreading support. We'd ideally want the Pardiso multithreaded performance to be enhanced by at least 40% (a multiplier of only x1.4, which is modest for parallelization). If achieved, it could surpass our in-house solver, making MKL a potential integration for our systems. Also any further improvement from the x1.4 would be a pure gain.
Seeking Explanation
Any insights into why the multithreaded performance lags behind the single-threaded version would be greatly appreciated. We're keen to understand if the problem size is the primary factor or if there are other underlying reasons.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mitch,
It seems the problem being solved is not big enough to take full advantage of the performance benefits offered by using multiple threads with the MKL Pardiso subroutine. The use of multithreading is not resulting in a noticeable improvement due to the relatively small size of the problem.
Best Regards,
Shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
A gentle reminder:
Could you please let us know if there are any updates on your issue?
Best Regards,
Shanmukh.SS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey,
Not really, I mean I still wouldn't expect that the Dynamic setting multithreaded would be so detrimental compared to setting it to a singlecore specifically. The problem size might be small, but surely the overhead of dynamic parallelism doesn't lead to a 2x-3x slowdown?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mitch,
We are looking into your issue internally. We will get back to you soon with an update.
Best Regards,
Shanmukh.SS

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page