topic the problem size is too small in Intel® oneAPI Math Kernel Library

Big Performance Problem with PARDISO 2018 Update 3

Göttinger__Michael — Thu, 23 Aug 2018 18:19:29 GMT

Hello folks,

I've a strange performance problem with PARDISO on Windows. Before I open a support call I'll hope to get some feedback in this forum.

I'm using Intel® Parallel Studio XE 2018 Update 3 Composer Edition for Fortran Windows*, Version 18.0.0040.

I have noticed that parallel processing in PARDISO in MKL version 2018.0.3 does not work at all and processing with only one thread is significantly slower than in version 2016.

Attached I've a small C++ test program and sample data to solve a small system multiple time.

When I run the program using the MKL DLLs from version 2018.0.3 I get following result:

>Release\pardiso.exe _data\mat.mm _data\b.mm
Intel(R) Math Kernel Library Version 2018.0.3 Product Build 20180406 for 32-bit applications
Solving matrix file _data\mat.mm with vector data _data\b.mm.
Data: rows=445, cols=445, values=1339
MKL threads: 6
Performance: Loops=10000, Time=2.785514 sec

And now the funny stuff starts. The same program executed with MKL DLLs from version 2016 (11.3.3) create the following result:

>Release\pardiso.exe _data\mat.mm _data\b.mm
Intel(R) Math Kernel Library Version 11.3.3 Product Build 20160413 for 32-bit applications
Solving matrix file _data\mat.mm with vector data _data\b.mm.
Data: rows=445, cols=445, values=1339
MKL threads: 6
Performance: Loops=10000, Time=1.171534 sec

And it's gonna get worse. The new PARDISO version 2018.0.3 uses a big amount of CPU time for multiple threads but it is slower compared with execution with only one single thread!

According to my understanding I've configured all stuff correct. And as it can be seen, using the old MKL stuff from 2016 it works fine.

For better understanding I

Göttinger__Michael — Thu, 23 Aug 2018 18:31:00 GMT

For better understanding I have attached log files containing PARDISO diagnostic data. It shows results from single and multicore runs. This also makes it clear that 6 threads are really used and at the same time the performance of MFLOPS decreases.

This is the result form 6 core parallel calculation:

Statistics:
===========
Parallel Direct Factorization is running on 6 OpenMP

< Linear system Ax = b >
             number of equations:           445
             number of non-zeros in A:      1339
             number of non-zeros in A (%): 0.676177

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 128
             number of independent subgraphs:  0
< Preprocessing with state of the art partitioning metis>
             number of supernodes:                    427
             size of largest supernode:               2
             number of non-zeros in L:                1153
             number of non-zeros in U:                672
             number of non-zeros in L+U:              1825
             gflop   for the numerical factorization: 0.000015

             gflop/s for the numerical factorization: 0.000479

Matrix Performance: Loops=1 Time=0.182937 sec

Here comes now the single core result. It has a better gflop/s performance as using MKL with 6 cores:

Statistics:
===========
Parallel Direct Factorization is running on 1 OpenMP

< Linear system Ax = b >
             number of equations:           445
             number of non-zeros in A:      1339
             number of non-zeros in A (%): 0.676177

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 128
             number of independent subgraphs:  0
< Preprocessing with state of the art partitioning metis>
             number of supernodes:                    427
             size of largest supernode:               2
             number of non-zeros in L:                1153
             number of non-zeros in U:                672
             number of non-zeros in L+U:              1825
             gflop   for the numerical factorization: 0.000015

             gflop/s for the numerical factorization: 0.000532

Matrix Performance: Loops=1 Time=0.164001 sec

the problem size is too small

Gennady_F_Intel — Fri, 24 Aug 2018 04:27:52 GMT

the problem size is too small. Do you see the similar performance regression with biggestt problem size too?

Quote:Gennady F. (Intel)

Göttinger__Michael — Fri, 24 Aug 2018 06:36:20 GMT

Gennady F. (Intel) wrote:

the problem size is too small. Do you see the similar performance regression with biggestt problem size too?

In my real application I can see same performance problem with larger systems too.

Anyway, I'll verify it in the small test program too. Please feel free to use my attached sample and any MM data file to verify it which a larger data set to be solved.

The main problem for me is that it seems to be 3 times slower in MKL 2018 as it was in MKL 2016. I'm happy to get feedback about compiler options and other settings which can be changed to get better PARDISO performance in MKL 2018 (or at least the same one as it was in the past).

Quote:Gennady F. (Intel)

LRaim — Fri, 24 Aug 2018 08:54:40 GMT

Gennady F. (Intel) wrote:

the problem size is too small. Do you see the similar performance regression with biggestt problem size too?

I am interested in the performance of Pardiso for systems with a number of equations around 500.
So possible solutions are: a) do not use pardiso for sparse systems with N<nnn. b) use pardiso but set max no of cores to 1. c) ......

Best regards

As mentioned in my previous

Göttinger__Michael — Fri, 24 Aug 2018 09:08:06 GMT

As mentioned in my previous post, I've done a test with a little bit larger matrix to be solved. Now the matrix is 131458x131458 with 712722 non-zero values. This is the typical size for our application.

The same performance problem in MKL 2018 is here too:

>Release\pardiso.exe _data\mat2.mm _data\b2.mm
Intel(R) Math Kernel Library Version 2018.0.3 Product Build 20180406 for 32-bit applications
Solving matrix file _data\mat2.mm with vector data _data\b2.mm.
Data: rows=131458, cols=131458, values=712722
MKL threads: 6
Performance: Loops=100, Time=8.250383 sec

Same system solved with MKL 2016:

>Release\pardiso.exe _data\mat2.mm _data\b2.mm
Intel(R) Math Kernel Library Version 11.3.3 Product Build 20160413 for 32-bit applications
Solving matrix file _data\mat2.mm with vector data _data\b2.mm.
Data: rows=131458, cols=131458, values=712722
MKL threads: 6
Performance: Loops=100, Time=4.823882 sec

As you can clearly see, the new MKL 2018 it about 50% slower as older versions.

I recently did the same

Andrew_Smith — Fri, 24 Aug 2018 17:19:14 GMT

I recently did the same upgrade in versions. I see a similar downgrade in run times, BUT it now gives better accuracy on my ill conditioned matrices and now matches IMSL and SuperLU in this respect. It was quite poor before and accuracy is as important to me as the speed.

I cannot say anything about multi-core as I long gave up on that aspect of PARDISO. But it might be worth another look now.

My problem sizes are between 500 and 20000 freedoms.

thanks Andrew and Michael. I

Gennady_F_Intel — Mon, 27 Aug 2018 10:21:28 GMT

thanks Andrew and Michael. I managed to reproduced the issue on our side and the case is escalated. we will keep you updated with the status.

Dears,

Beccaria__Massimilia — Wed, 17 Jul 2019 17:35:59 GMT

Dears,

has this issue in PARDISO been fixed in any of the more recent releases of MKL?

Thanks and kind regards

Quote:Gennady F. (Blackbelt)

Beccaria__Massimilia — Tue, 23 Jul 2019 17:22:41 GMT

Dears,

Is it possible to have an update on this?

Gennady F. (Blackbelt) wrote:
thanks Andrew and Michael. I managed to reproduced the issue on our side and the case is escalated. we will keep you updated with the status.