Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6956 Discussions

Version 19.1 running slower that 19.0 in Pardiso solver

Ian_K_
New Contributor I
994 Views

I am running a simulation model (using finite elements) that requires solution of multiple sets of linear equations.  There can be more than 500,000 sparse equations.  I use OMP to parallelise the equation building and PARDISO to reorder and solve the equations.  Usually I can reorder once and use that order repeatedly.  My problem is that there has been a significant cpu slowdown in the solver section as I move from version 19.0 to version 19.1

This is an extract from my time reporting for version 19.0

Coef + Assemble time =     0.49
      Solve time     =     1.53
  TIME IN FRONT        2.04 
Coef + Assemble time =     0.50
      Solve time     =     1.52
  TIME IN FRONT        2.04 

and this is for version 19.1  using identical compiler settings
Coef + Assemble time =     0.44
      Solve time     =     1.84
  TIME IN FRONT        2.30 
      Setup time     =     0.00
Coef + Assemble time =     0.44
      Solve time     =     1.79
  TIME IN FRONT        2.25 
\I am running a Windows 10 Dell Inspiron laptop with a core I7-8550U CPU @ 1.80Ghz   with 16GB.  I have set the system to maximum speed in the power settings

When my colleague with an HP based desktop   core I7-4770 @ 3.40Ghz  with 16GB   the cpu times are faster (better than factor of 2) but very similar for each compiler version.

So, does anybody have an idea why.  I have tried disabling OMP but the difference for the solver remains.  I am buffering IO and set the model to skip most output but the differences remain.  I have tested on other size problems (350,000 and 180,00 eqns) and the differences still appear

Any suggestions/ ideas to test would be welcome

 

 

0 Kudos
15 Replies
jimdempseyatthecove
Honored Contributor III
994 Views

Please list any and all "unknowns".

For example, compiler optimization switches would be a good start.

Jim Dempsey

0 Kudos
JohnNichols
Valued Contributor III
994 Views

There is a difference in the performance of the chips, the HP is about 15% faster. 

SSD make a huge difference - sometimes.  So is there a difference here. 

Pardiso is still better than any other method you will find. 

0 Kudos
Ian_K_
New Contributor I
994 Views

Thanks for the comments

 

The command line options used are as follows.

Compiling

/nologo /MP /O2 /QaxCORE-AVX2 /Qunroll:3 /assume:buffered_io /Qopenmp /module:"x64\Release\\" /object:"x64\Release\\" /Fd"x64\Release\vc150.pdb" /libs:qwin /Qmkl:parallel /c

Linking

/OUT:"x64\Release\rma2v90-64-1910-avx2-jan15.exe" /NOLOGO /MANIFEST /MANIFESTFILE:"x64\Release\rma2v90-64-1910-avx2-jan15.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /SUBSYSTEM:WINDOWS /IMPLIB:"C:\Users\RMADELL\source\repos\rma2v90-64\rma2v90-64\x64\Release\rma2v90-64-1910-avx2-jan15.lib"

I have tried with and without the /QaxCORE-AVX2 option, the differences remain and the speeds are very similar.  The laptop has 512GB with over 100 GB unused.  I have also a 32 bit version and the same differences occur.  The compiled executable  is of the order 100MB

The fact is that run times are very insensitive to optimization options largely due the the fact most time is spent in PARDISO and in tight loops forming the equations which I have parallelized with OMP 

My main concern is why the slowdown with the same options on my laptop.  The speed matching on the other processor (which also occurred on an older core I7 has just confused me.

I note and agree that PARDISO is one of the best.  It has served me well, but any improvement adds up for me as I often need to solve these equations more than 10,000 times per simulation.

 

 

0 Kudos
mecej4
Honored Contributor III
994 Views

If your coefficient matrix is constant, you can have Pardiso perform the L-U factorization just once, and use the factors as many times as needed with different RHS vectors. If you are not already taking advantage of this feature, you will find that with just a couple of changes to the code you will be able to obtain significant reductions in run times.

0 Kudos
Ian_K_
New Contributor I
994 Views

Thank you for the thought, in this case the equations have the same structure but the coefficients change.  So I can only take advantage of the previous reordering.and knowing the precise locations in the compact equation storage.  In fact that reduces the time per step by about a factor of 3.  

0 Kudos
JohnNichols
Valued Contributor III
994 Views

I note and agree that PARDISO is one of the best.  It has served me well, but any improvement adds up for me as I often need to solve these equations more than 10,000 times per simulation.

This is the real problem with some numerical methods where you want the Monte Carlo results. I suggest you make a small sample and put it up on the forum - these lot are really good problem solvers. 

 

0 Kudos
Ian_K_
New Contributor I
994 Views

Thank you John for the suggestion, however the purpose of my post was to try to understand why my particular laptop runs slower in PARDISO with version 19.1.  The main point being this problem appears specific to my laptop or possibly the core I7 processor.  The only plausible idea that I have at the moment is that in version 19.1  PARDISO is not seeing as may cores/cpu's available as in version 19.0. 

So my question becomes, is there any way for PARDISO to report the number of cpu's it sees from release mode?

0 Kudos
JohnNichols
Valued Contributor III
994 Views

So my question becomes, is there any way for PARDISO to report the number of cpu's it sees from release mode?

I suggest you go to the PARDISO web site and look at the documentation. 

PS - If you try MATH.NET equivalent  - you will find out what 1000 times slower looks like . 

 

0 Kudos
Gennady_F_Intel
Moderator
994 Views

>> is there any way for PARDISO to report the number of cpu's it sees from release mode?

setting msglvl = 1, you may ask PARDISO prints statistical information.

e.x, you will see into the output something like this: Parallel Direct Factorization is running on 16 OpenMP 

0 Kudos
Ian_K_
New Contributor I
994 Views

When I got back to this issue, I was initially unable to see the messages on the screen, then when I ran the model in batch form and directed output to the a log file, there were the messages.  Unfortunately, all I gleaned was that 4 cpus were active and the flop rate dropped off for version 19.1 for the solution step.  As expected the same number of mega flops were reported (this at least means that the same logic paths were undertaken).  So, I am no closer to resolving the issue.  I have reported this to Intel support but that has lead to no help (not even a response in the last week or so).   

0 Kudos
Kirill_V_Intel
Employee
994 Views

Hello Ian!

First, could you copy here the exact output from PARDISO msglvl=1 (for one of the runs if you have many) for both runs with 2019.0 and 2019.1? If you see something like "Parallel Direct Factorization is running on 4 OpenMP", ok, we know, that PARDISO did go into multi-threaded version.

Second, performance depends on a lot of factors. To exclude the factor of the machine, we'd rather run with your matrix on one of ours. Could you share with us the matrix data and PARDISO settings you have used? If we observe the same slowdown in our runs, it's most likely a regression on the side of MKL which we will investigate. If we don't see slowdown, we can talk more about the machine settings you're using.

Best,
Kirill

0 Kudos
Gennady_F_Intel
Moderator
994 Views

Ian, regarding to " I have reported this to Intel support but that has lead to no help (not even a response in the last week or so).  ". Could you give us the number of this request? 

0 Kudos
Ian_K_
New Contributor I
994 Views

Gennardy,   The case number  is 04475324

0 Kudos
Gennady_F_Intel
Moderator
994 Views

ok, I see ~20 files attached to 04475324. Which files we could try to use to get the input matrix and the reproducer? 

0 Kudos
Ian_K_
New Contributor I
994 Views

Gennady,

Working forwards in time of my submittals.

Testcase.zip contains the test case with results showing the time differences.  RM2 is the executable file

Srcrma2.zip is the source code that I compiled.  BLK1A.FOR
CULVERT.FOR are redundant and MKL_solver.for has two versions one for 32 bit and one for 64 bit

Buildlog-1905.htm is a sample log from building the model

Rma2v90-64-1905.zip and Rma2v90-64-1910.zip are executables from the

two versions of the compiler

V1910.zip is a revised test case that is shorter.  Rma2v90-64.vfproj is a sample of how I built it.

For me the main issue is why the model runs slower in version 19.1 but other computers do not.  I have tested on variously complex tests and the difference always occurs for my Dell laptop.

Thanks for your consideration

Ian

0 Kudos
Reply