Pardiso sometimes slow reordering with weighted matching

Essex_E_ · ‎07-13-2017

I have two matrices with the same sparsity structure but different values that take significantly different amounts of time (7s vs 37s) during the reordering phase. The diagnostics report that all of the time differences is inside of "Time spent in allocation of internal data structures (malloc)". I find the time difference surprising, and I find it surprising that it would be in 'malloc'.

Is this all expected? Is there something I can do to increase performance here?

If I turn off weighted matching, the difference goes away and reordering is faster in both cases. I would simply turn weighted matching off, but I have found it's necessary to get accurate solutions on many of the other problems our users construct.

The matrices should be quite similar. They are nearby timesteps of a dynamic non-linear FEM simulation of an elastic cloth hanging under gravity.

If I run the same physical scenario on a low resolution cloth (20'000 triangles instead of 180'000 triangles), the slowdown doesn't appear.

My iparm settings are all zero, except for these ones:

iparm[0] = 1; // use my settings
iparm[1] = 2; // METIS
iparm[9] = 8; // pivot perturbation of 1.0E-8
iparm[10] = 1; // scaling
iparm[12] = 1; // weighted matching
iparm[17] = -1; // enable reporting
iparm[20] = 1;  // enable reporting

Reording phase for the "fast" matrix:

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.051754 s
Time spent in reordering of the initial matrix (reorder)         : 0.002212 s
Time spent in symbolic factorization (symbfct)                   : 0.527827 s
Time spent in data preparations for factorization (parlist)      : 0.034113 s
Time spent in allocation of internal data structures (malloc)    : 6.139631 s
Time spent in additional calculations                            : 0.451844 s
Total time spent                                                 : 7.207380 s

Reording phase for the "slow" matrix:

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.051663 s
Time spent in reordering of the initial matrix (reorder)         : 0.002180 s
Time spent in symbolic factorization (symbfct)                   : 0.529333 s
Time spent in data preparations for factorization (parlist)      : 0.033952 s
Time spent in allocation of internal data structures (malloc)    : 35.833652 s
Time spent in additional calculations                            : 0.451910 s
Total time spent                                                 : 36.902690 s

Matrix statistics after reordering phase for both matrices are exactly the same. The factorization and solution phases have almost the exact same runtimes for both matrices.

Statistics:
===========
Parallel Direct Factorization is running on 8 OpenMP

< Linear system Ax = b >
             number of equations:           1351803
             number of non-zeros in A:      9191733
             number of non-zeros in A (%): 0.000503

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 128
             number of independent subgraphs:  0
< Preprocessing with state of the art partitioning metis>
             number of supernodes:                    1115919
             size of largest supernode:               2555
             number of non-zeros in L:                93923643
             number of non-zeros in U:                1
             number of non-zeros in L+U:              93923644

Summary: ( factorization phase )
================
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct)                        : 1.511711 s
Time spent in allocation of internal data structures (malloc)    : 0.000519 s
Time spent in additional calculations                            : 0.000004 s
Total time spent                                                 : 1.512234 s

Summary: ( solution phase )
================
Time spent in direct solver at solve step (solve)                : 0.229972 s
Time spent in additional calculations                            : 0.499647 s
Total time spent                                                 : 0.729619 s

I'm running Windows 10, compiling and linking with MSVC using toolset Visual Studio 2015 (v140). I'm not sure how to double-check the MKL version I'm using. MKL is in a folder named "compilers_and_libraries_2016.4.246". I have "Intel Parallel Studio 2016 Update 4 Profession Edition for Windows" installed. I have "Intel VTune Amplifier 2017 for Windows Update 4" installed. The only download that Intel Software Manager is recommending to me is an update for parallel studio 2015, which I ignore because I think it's old.

My compile line looks like:

/Yu"stdafx.h" /MP /GS /W3 /Gy /Zc:wchar_t /I"C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\tbb\include" /I"C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\mkl\include" /Zi /Gm- /O2 /sdl /Fd"x64\Develop\vc140.pdb" /Zc:inline /fp:precise ... more defines ... /D "NDEBUG" /D "_CONSOLE" /D "WIN32_LEAN_AND_MEAN" /D "NOMINMAX" /D "_USE_MATH_DEFINES" /D "_SCL_SECURE_NO_WARNINGS" /D "_CRT_SECURE_NO_WARNINGS" /errorReport:prompt /WX- /Zc:forScope /Gd /Oi /MD /openmp- /Fa"x64\Develop\" /EHsc /nologo /Fo"x64\Develop\" /Fp"x64\Develop\solveLinearSystem.pch"

My link line looks like:

/OUT:"x64\Develop\solveLinearSystem.exe" /MANIFEST /NXCOMPAT /PDB:"x64\Develop\solveLinearSystem.pdb" /DYNAMICBASE ... some libs ... "tbb.lib" ... some libs ... "mkl_core.lib" "mkl_tbb_thread.lib" "mkl_intel_lp64.lib" ... some libs ... "kernel32.lib" "user32.lib" "gdi32.lib" "winspool.lib" "comdlg32.lib" "advapi32.lib" "shell32.lib" "ole32.lib" "oleaut32.lib" "uuid.lib" "odbc32.lib" "odbccp32.lib" /DEBUG /MACHINE:X64 /OPT:NOREF /INCREMENTAL /PGD:"x64\Develop\solveLinearSystem.pgd" /SUBSYSTEM:CONSOLE /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /ManifestFile:"x64\Develop\solveLinearSystem.exe.intermediate.manifest" /OPT:NOICF /ERRORREPORT:PROMPT /NOLOGO /LIBPATH:"C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\tbb\lib\intel64\vc14" /LIBPATH:"C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\mkl\lib\intel64_win" /LIBPATH:"C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\compiler\lib\intel64_win" /TLBID:1

I'm happy to share the matrices if that's helpful. I have them in plain text, zipped to ~60MB each.

Essex_E_ · ‎07-13-2017

I just updated MKL:

mkl_version.h says:

#define __INTEL_MKL_BUILD_DATE 20170413

#define __INTEL_MKL__ 2017
#define __INTEL_MKL_MINOR__ 0
#define __INTEL_MKL_UPDATE__ 3

#define INTEL_MKL_VERSION 20170003

Results are unchanged.

Ying_H_Intel · ‎07-16-2017

Hi Essex,

If possible, could you please tell the machine cpu type and share the input matrix to us so we can debug the issue?

If for private, you can send us the private message, or submit the issue by http://supporttickets.intel.com/

Best Regards,

Ying

Essex_E_ · ‎07-17-2017

CPU: Intel Core i7 5960X @ 3.00GHz. Detailed information about the entire computer is in this text file.

I zipped the matrices (using 7zip) and put them on Google Drive: the slow matrix and the fast matrix. They are both in plain-text format with 1 line listing the number of rows, 1 line listing the number of columns, and then the remaining lines list triplets of ROW, COL, VALUE. The tests above were done by loading these files into a little test program then factoring/solving them, so these are exactly the matrices I was testing above.

Essex_E_ · ‎07-18-2017

On an unrelated topic, I have another problem (link) that I keep trying to post in this forum, but it doesn't show up anywhere. Can someone explain to me how to add a new topic in this forum (the "new topic" button isn't doing it for me). Alternatively, can a mod please move my post into this forum? This obviously isn't the right place for this question, but since I can't make a new post I don't know where else to address this.

Paul_N_2 · ‎07-18-2017

I also have an issue with the weighted matching in the Pardiso solver. And, like the current poster, my topic also does not show up on the MKL forum.

Ying_H_Intel · ‎07-24-2017

Hi Essex.

Thank you for the test case. We will look into them.

Hi Paul,

It seems you haven't setting Intel Math Kernel Library product category. You may create a ticket to

https://supporttickets.intel.com/?lang=en-US. where is Intel software official support channel.

Best Regards,

Ying

Ying_H_Intel · ‎08-02-2017

Hi Essex,

We reply your thread in https://software.intel.com/en-us/node/738341 and in https://supporttickets.intel.com/?lang=en-US. .

Not sure if you see them. I attached them here again.

Here are some more details and hopefully a resolution for now. When I looked more closely at your matrix sparsity pattern, I observed that it looks similar to the structure of some saddle point problems where we often have a block system with one or more of the main diagonal blocs being zero blocks. For example:

M = [ A B]
[B^T 0]

You are treating your matrix as (mtype= -2) a real symmetric indefinite matrix (which it is) but weighted matching with scaling is known to have some trouble with this type of matrix. The goal is to get large elements close to the diagonal but sometimes it works, sometimes it doesn't, when the matrix has these zero blocks. Now the work around is to instead treat the matrix as real unsymmetric (mtype=11) matrix. This means you need to read in all the elements on each row for CSR format instead of just the upper triangular part.

Otherwise, use the same parameters as before and it should handle the zero block without issue:

			
						1
						  iparm[0] = 1; // user user-provided values
					
						2
						  iparm[1] = 2; // METIS reordering
					
						3
						  iparm[9] = 8; // 1e-8 pivoting perturbation
					
						4
						  iparm[10] = 1; // enable scaling
					
						5
						  iparm[12] = 1; // enable weighted matching
					
						6
						  iparm[17] = -1; // output nnz in factors
					
						7
						  iparm[20] = 1; // use Bunch-Kaufman pivoting
					
						8
						  iparm[26] = 1; // check the input matrix

You can observe the number of perturbed pivots (see iparm(10) description )from this weighted matching in iparm(14) (iparm[13] for c code) . The unsymmetric algorithm can handle the zeros on the diagonal and so I observe the number of perturbed pivots to be ~500 with mtype=-2 but =0 for mtype=11. You residual will now be in the range desired. Mine is around ~1e-8.

I hope this helps!

Spencer

Ying_H_Intel · ‎08-04-2017

Hi Essex, Paul,

Regarding the reordering with weighted matching, please see spencer investigation notes:

I have played around with this timing issue. The two matrices do have the same sparsity structure but when we use weighted matching, the values also come into play. This is an expected behavior with our current implementation of weighted matching. The goal is to move large elements of the matrix to the diagonal and so it does take into account the actual values of the matrix. This explains the difference between the two matrices when using weighted matching.

These matrices are somewhat large (1.3 x 1.3 million with 9 million entries) so creating internal data structures to house that does add up. The slow time matrix is doing a lot more internal work to get those larger elements to the diagonal. moving large amounts of data around can be expensive time-wise.

I tested this timing with mtype = -2 and mtype=11. The result is consistent that one matrix requires more time than the other so there is no help from that change.

So generally speaking, this is expected behavior and there really isn’t much we can do at the moment to speed it up.

You may turn off it and the accurate problem mention above may be resolved by use other matrix type.

Hope this helps.

Ying

Essex_E_ · ‎08-04-2017

Okay, thanks for looking into it.