Version 19.1 running slower that 19.0 in Pardiso solver

Ian_K_ · ‎01-28-2020

I am running a simulation model (using finite elements) that requires solution of multiple sets of linear equations. There can be more than 500,000 sparse equations. I use OMP to parallelise the equation building and PARDISO to reorder and solve the equations. Usually I can reorder once and use that order repeatedly. My problem is that there has been a significant cpu slowdown in the solver section as I move from version 19.0 to version 19.1

This is an extract from my time reporting for version 19.0

Coef + Assemble time = 0.49
Solve time = 1.53
TIME IN FRONT 2.04
Coef + Assemble time = 0.50
Solve time = 1.52
TIME IN FRONT 2.04

and this is for version 19.1 using identical compiler settings
Coef + Assemble time = 0.44
Solve time = 1.84
TIME IN FRONT 2.30
Setup time = 0.00
Coef + Assemble time = 0.44
Solve time = 1.79
TIME IN FRONT 2.25
\I am running a Windows 10 Dell Inspiron laptop with a core I7-8550U CPU @ 1.80Ghz with 16GB. I have set the system to maximum speed in the power settings

When my colleague with an HP based desktop core I7-4770 @ 3.40Ghz with 16GB the cpu times are faster (better than factor of 2) but very similar for each compiler version.

So, does anybody have an idea why. I have tried disabling OMP but the difference for the solver remains. I am buffering IO and set the model to skip most output but the differences remain. I have tested on other size problems (350,000 and 180,00 eqns) and the differences still appear

Any suggestions/ ideas to test would be welcome

jimdempseyatthecove · ‎01-30-2020

Please list any and all "unknowns".

For example, compiler optimization switches would be a good start.

Jim Dempsey

JohnNichols · ‎01-30-2020

There is a difference in the performance of the chips, the HP is about 15% faster.

SSD make a huge difference - sometimes. So is there a difference here.

Pardiso is still better than any other method you will find.

Ian_K_ · ‎01-30-2020

Thanks for the comments

The command line options used are as follows.

Compiling

/nologo /MP /O2 /QaxCORE-AVX2 /Qunroll:3 /assume:buffered_io /Qopenmp /module:"x64\Release\\" /object:"x64\Release\\" /Fd"x64\Release\vc150.pdb" /libs:qwin /Qmkl:parallel /c

Linking

/OUT:"x64\Release\rma2v90-64-1910-avx2-jan15.exe" /NOLOGO /MANIFEST /MANIFESTFILE:"x64\Release\rma2v90-64-1910-avx2-jan15.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /SUBSYSTEM:WINDOWS /IMPLIB:"C:\Users\RMADELL\source\repos\rma2v90-64\rma2v90-64\x64\Release\rma2v90-64-1910-avx2-jan15.lib"

I have tried with and without the /QaxCORE-AVX2 option, the differences remain and the speeds are very similar. The laptop has 512GB with over 100 GB unused. I have also a 32 bit version and the same differences occur. The compiled executable is of the order 100MB

The fact is that run times are very insensitive to optimization options largely due the the fact most time is spent in PARDISO and in tight loops forming the equations which I have parallelized with OMP

My main concern is why the slowdown with the same options on my laptop. The speed matching on the other processor (which also occurred on an older core I7 has just confused me.

I note and agree that PARDISO is one of the best. It has served me well, but any improvement adds up for me as I often need to solve these equations more than 10,000 times per simulation.

mecej4 · ‎01-30-2020

If your coefficient matrix is constant, you can have Pardiso perform the L-U factorization just once, and use the factors as many times as needed with different RHS vectors. If you are not already taking advantage of this feature, you will find that with just a couple of changes to the code you will be able to obtain significant reductions in run times.

Ian_K_ · ‎01-30-2020

Thank you for the thought, in this case the equations have the same structure but the coefficients change. So I can only take advantage of the previous reordering.and knowing the precise locations in the compact equation storage. In fact that reduces the time per step by about a factor of 3.

JohnNichols · ‎02-01-2020

I note and agree that PARDISO is one of the best. It has served me well, but any improvement adds up for me as I often need to solve these equations more than 10,000 times per simulation.

This is the real problem with some numerical methods where you want the Monte Carlo results. I suggest you make a small sample and put it up on the forum - these lot are really good problem solvers.

Ian_K_ · ‎02-02-2020

Thank you John for the suggestion, however the purpose of my post was to try to understand why my particular laptop runs slower in PARDISO with version 19.1. The main point being this problem appears specific to my laptop or possibly the core I7 processor. The only plausible idea that I have at the moment is that in version 19.1 PARDISO is not seeing as may cores/cpu's available as in version 19.0.

So my question becomes, is there any way for PARDISO to report the number of cpu's it sees from release mode?

JohnNichols · ‎02-03-2020

So my question becomes, is there any way for PARDISO to report the number of cpu's it sees from release mode?

I suggest you go to the PARDISO web site and look at the documentation.

PS - If you try MATH.NET equivalent - you will find out what 1000 times slower looks like .

Gennady_F_Intel · ‎02-05-2020

>> is there any way for PARDISO to report the number of cpu's it sees from release mode?

setting msglvl = 1, you may ask PARDISO prints statistical information.

e.x, you will see into the output something like this: Parallel Direct Factorization is running on 16 OpenMP

Ian_K_ · ‎02-13-2020

When I got back to this issue, I was initially unable to see the messages on the screen, then when I ran the model in batch form and directed output to the a log file, there were the messages. Unfortunately, all I gleaned was that 4 cpus were active and the flop rate dropped off for version 19.1 for the solution step. As expected the same number of mega flops were reported (this at least means that the same logic paths were undertaken). So, I am no closer to resolving the issue. I have reported this to Intel support but that has lead to no help (not even a response in the last week or so).

Kirill_V_Intel · ‎02-13-2020

Hello Ian!

First, could you copy here the exact output from PARDISO msglvl=1 (for one of the runs if you have many) for both runs with 2019.0 and 2019.1? If you see something like "Parallel Direct Factorization is running on 4 OpenMP", ok, we know, that PARDISO did go into multi-threaded version.

Second, performance depends on a lot of factors. To exclude the factor of the machine, we'd rather run with your matrix on one of ours. Could you share with us the matrix data and PARDISO settings you have used? If we observe the same slowdown in our runs, it's most likely a regression on the side of MKL which we will investigate. If we don't see slowdown, we can talk more about the machine settings you're using.

Best,
Kirill

Gennady_F_Intel · ‎02-13-2020

Ian, regarding to " I have reported this to Intel support but that has lead to no help (not even a response in the last week or so). ". Could you give us the number of this request?

Ian_K_ · ‎02-13-2020

Gennardy, The case number is 04475324

Gennady_F_Intel · ‎02-14-2020

ok, I see ~20 files attached to 04475324. Which files we could try to use to get the input matrix and the reproducer?

Ian_K_ · ‎02-15-2020

Gennady,

Working forwards in time of my submittals.

Testcase.zip contains the test case with results showing the time differences. RM2 is the executable file

Srcrma2.zip is the source code that I compiled. BLK1A.FOR
CULVERT.FOR are redundant and MKL_solver.for has two versions one for 32 bit and one for 64 bit

Buildlog-1905.htm is a sample log from building the model

Rma2v90-64-1905.zip and Rma2v90-64-1910.zip are executables from the

two versions of the compiler

V1910.zip is a revised test case that is shorter. Rma2v90-64.vfproj is a sample of how I built it.

For me the main issue is why the model runs slower in version 19.1 but other computers do not. I have tested on variously complex tests and the difference always occurs for my Dell laptop.

Thanks for your consideration

Ian