Direct Sparse Solver for Clusters - Pardiso Memory Allocation Error

William_D_2 · ‎06-25-2017

Hi,

Once again having some trouble with the Direct Sparse Solver for clusters. I am getting the following error when running on a single process

entering matrix solver
*** Error in PARDISO  (     insufficient_memory) error_num= 1
*** Error in PARDISO memory allocation: MATCHING_REORDERING_DATA, allocation of 1 bytes failed
total memory wanted here: 142 kbyte

=== PARDISO: solving a real structurally symmetric system ===
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
METIS algorithm at reorder step is turned ON


Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.000005 s
Time spent in reordering of the initial matrix (reorder)         : 0.000000 s
Time spent in symbolic factorization (symbfct)                   : 0.000000 s
Time spent in allocation of internal data structures (malloc)    : 0.000465 s
Time spent in additional calculations                            : 0.000080 s
Total time spent                                                 : 0.000550 s

Statistics:
===========
Parallel Direct Factorization is running on 1 OpenMP

< Linear system Ax = b >
             number of equations:           6
             number of non-zeros in A:      8
             number of non-zeros in A (%): 22.222222

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 128
             number of independent subgraphs:  0
< Preprocessing with state of the art partitioning metis>
             number of supernodes:                    0
             size of largest supernode:               0
             number of non-zeros in L:                0
             number of non-zeros in U:                0
             number of non-zeros in L+U:              0

ERROR during solution: 4294967294

I just hangs when running on a single process. Below is the CSR format of my matrix and the provided RHS to solve for

CSR row values
0
2
6
9
12
16
18

CSR col values
0
1
0
1
2
3
1
2
4
1
3
4
2
3
4
5
4
5

Rank 0 rhs vector :
1
0
0
0
0
1

Now my calling file looks like:

void SolveMatrixEquations(MKL_INT numRows, MatrixPointerStruct &cArrayStruct, const std::pair<MKL_INT,MKL_INT>& rowExtents)
{
	
	double pressureSolveTime = -omp_get_wtime();

	MKL_INT mtype = 1;  /* set matrix type to "real structurally symmetric" */
	MKL_INT nrhs = 1;  /* number of right hand sides. */

	void *pt[64] = { 0 }; //internal memory Pointer

						  /* Cluster Sparse Solver control parameters. */
	MKL_INT iparm[64] = { 0 };
	MKL_INT maxfct, mnum, phase=13, msglvl, error;

	/* Auxiliary variables. */
	float   ddum; /* float dummy   */
	MKL_INT idum; /* Integer dummy. */
	MKL_INT i, j;

	/* -------------------------------------------------------------------- */
	/* .. Init MPI.                                                         */
	/* -------------------------------------------------------------------- */
	
	int     mpi_stat = 0;
	int     comm, rank;
	mpi_stat = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	comm = MPI_Comm_c2f(MPI_COMM_WORLD);

	/* -------------------------------------------------------------------- */
	/* .. Setup Cluster Sparse Solver control parameters.                                 */
	/* -------------------------------------------------------------------- */
	iparm[0] = 0; /* Solver default parameters overridden with provided by iparm */
	iparm[1] =3; /* Use METIS for fill-in reordering */
	//iparm[1] = 10; /* Use parMETIS for fill-in reordering */
	iparm[5] = 0; /* Write solution into x */
	iparm[7] = 2; /* Max number of iterative refinement steps */
	iparm[9] = 8; /* Perturb the pivot elements with 1E-13 */
	iparm[10] = 0; /* Don't use non-symmetric permutation and scaling MPS */
	iparm[12] = 0; /* Switch on Maximum Weighted Matching algorithm (default for non-symmetric) */
	iparm[17] = 0; /* Output: Number of non-zeros in the factor LU */
	iparm[18] = 0; /* Output: Mflops for LU factorization */
	iparm[20] = 0; /*change pivoting for use in symmetric indefinite matrices*/
	iparm[26] = 1;
	iparm[27] = 0; /* Single precision mode of Cluster Sparse Solver */
	iparm[34] = 1; /* Cluster Sparse Solver use C-style indexing for ia and ja arrays */

	iparm[39] = 2; /* Input: matrix/rhs/solution stored on master */
	iparm[40] = rowExtents.first+1;
	iparm[41] = rowExtents.second+1; 
	maxfct = 3; /* Maximum number of numerical factorizations. */
	mnum = 1; /* Which factorization to use. */
	msglvl = 1; /* Print statistical information in file */
	error = 0; /* Initialize error flag */
	//cout << "Rank " << rank << ": " << iparm[40] << " " << iparm[41] << endl;
#ifdef UNIT_TESTS
	//msglvl = 0;
#endif




	phase = 11;
	#ifndef UNIT_TESTS
	if (rank == 0)printf("Restructuring system...\n");
	cout << "Restructuring system...\n" <<endl;;
	#endif

	cluster_sparse_solver(pt, &maxfct, &mnum, &mtype, &phase,
		&numRows, &ddum, cArrayStruct.rowIndexArray, cArrayStruct.colIndexArray, &idum, &nrhs, iparm, &msglvl,
		&ddum, &ddum, &comm, &error);
	if (error != 0)
	{
		cout << "\nERROR during solution: " << error << endl;
		exit(error);
	}


	phase = 23;

#ifndef UNIT_TESTS
//	if (rank == 0) printf("\nSolving system...\n");
	printf("\nSolving system...\n");
#endif

	cluster_sparse_solver_64(pt, &maxfct, &mnum, &mtype, &phase,
		&numRows, cArrayStruct.valArray, cArrayStruct.rowIndexArray, cArrayStruct.colIndexArray, &idum, &nrhs, iparm, &msglvl,
		cArrayStruct.rhsVector, cArrayStruct.pressureSolutionVector, &comm, &error);
	if (error != 0)
	{
		cout << "\nERROR during solution: " << error << endl;
		exit(error);
	}

	phase = -1; /* Release internal memory. */
	cluster_sparse_solver_64(pt, &maxfct, &mnum, &mtype, &phase,
		&numRows, &ddum, cArrayStruct.rowIndexArray, cArrayStruct.colIndexArray, &idum, &nrhs, iparm, &msglvl, &ddum, &ddum, &comm, &error);
	if (error != 0)
	{
		cout << "\nERROR during release memory: " << error << endl;
		exit(error);
	}
	/* Check residual */

	pressureSolveTime += omp_get_wtime();


#ifndef UNIT_TESTS
	//cout << "Pressure Solve Time: " << pressureSolveTime << endl;
#endif
	
	//TestPrintCsrMatrix(cArrayStruct,rowExtents.second-rowExtents.first +1);
}

This is based on the format of one of the examples. Now i am trying to use the ILP64 interface becasue my example system is very large. (16 billion non-zeros). I am using the Intel C++ compiler 2017 as part of the Intel Composer XE Cluster Edition Update 1. I using the following link lines in my Cmake files:

TARGET_COMPILE_OPTIONS(${MY_TARGET_NAME} PUBLIC "-mkl:cluster"  "-DMKL_ILP64" "-I$ENV{MKLROOT}/include")
TARGET_LINK_LIBRARIES(${MY_TARGET_NAME} "-Wl,--start-group $ENV{MKLROOT}/lib/intel64/libmkl_intel_ilp64.a $ENV{MKLROOT}/lib/intel64/libmkl_intel_thread.a $ENV{MKLROOT}/lib/intel64/libmkl_core.a $ENV{MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl")

What is interesting is that this same code runs perfectly fine on my windows development machine. Porting it to my linux cluster is causing issues. Any Ideas?

I am currently awaiting the terribly long download for the update 4 Composer XE package. But I don't have much hope of that fixing it because this code used to run fine on this system.

William_D_2 · ‎06-25-2017

Having a similar problem with the function mkl_dcsrcoo()

input COO matrix

0 0 1
0 1 0
1 0 52745.6
1 1 -135815
1 2 41534.7
1 3 41534.7
2 1 41534.7
2 2 -83069.4
2 4 41534.7
3 1 41534.7
3 3 -83069.4
3 4 41534.7
4 2 41534.7
4 3 41534.7
4 4 -135815
4 5 52745.6
5 4 52745.6
5 5 -52745.6

output csr row indexes: 17179869184 30064771079 30064771079 7 0 0 0

Ying_H_Intel · ‎07-09-2017

Hi William,

I did check with main program with your input parameters. I haven't seen the exact error as yours, but seemingly there is some issues in these parameters.

for example, iparm[34] = 1; mean 0 based. /* Cluster Sparse Solver use C-style indexing for ia and ja arrays */

but in solver's output . it report 1 -based.

=== PARDISO: solving a real structurally symmetric system ===

`07`	`1-based array indexing is turned ON`

I attached the main code. Please have a check and let me know if it works on your environment.

Best Regards,

Ying

build command :

yhu5@kbl01-ub:~/Cluster_pardiso/cluster_sparse_solverc$ mpiicc -Wall -DMKL_ILP64 -I/opt/intel/compilers_and_libraries_2018.0.098/linux/mkl/include w_solver.cpp -Wl,--start-group "/opt/intel/compilers_and_libraries_2018.0.098/linux/mkl/lib/intel64"/libmkl_blacs_intelmpi_ilp64.a "/opt/intel/compilers_and_libraries_2018.0.098/linux/mkl/lib/intel64"/libmkl_intel_ilp64.a "/opt/intel/compilers_and_libraries_2018.0.098/linux/mkl/lib/intel64"/libmkl_core.a "/opt/intel/compilers_and_libraries_2018.0.098/linux/mkl/lib/intel64"/libmkl_intel_thread.a -Wl,--end-group -L "/opt/intel/compilers_and_libraries_2018.0.098/linux/mkl/../compiler/lib/intel64" -liomp5 -mt_mpi -lm

and run command:
yhu5@kbl01-ub:~/Cluster_pardiso/cluster_sparse_solverc$ mpirun -n 1 ./a.out Restructuring system...

=== PARDISO: solving a real structurally symmetric system ===
Matrix checker is turned ON
0-based array is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.000023 s
Time spent in reordering of the initial matrix (reorder)         : 0.000041 s
Time spent in symbolic factorization (symbfct)                   : 0.000270 s
Time spent in data preparations for factorization (parlist)      : 0.000000 s
Time spent in allocation of internal data structures (malloc)    : 0.000052 s
Time spent in additional calculations                            : 0.000015 s
Total time spent                                                 : 0.000401 s

Statistics:
===========
Parallel Direct Factorization is running on 4 OpenMP

< Linear system Ax = b >
             number of equations:           6
             number of non-zeros in A:      18
             number of non-zeros in A (%): 50.000000

number of right-hand sides: 1

< Factors L and U >
             number of columns for each panel: 128
             number of independent subgraphs: 0
             number of supernodes:                    3
             size of largest supernode:               3
             number of non-zeros in L:                19
             number of non-zeros in U:                5
             number of non-zeros in L+U:              24

Reordering completed ...
Solving system...
=== PARDISO is running in In-Core mode, because iparam(60)=0 ===

Percentage of computed non-zeros for LL^T factorization
42 % 52 % 100 %

=== PARDISO: solving a real structurally symmetric system ===
Single-level factorization algorithm is turned ON

Summary: ( starting phase is factorization, ending phase is solution )
================

Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct)                        : 0.000078 s
Time spent in direct solver at solve step (solve)                : 0.000021 s
Time spent in allocation of internal data structures (malloc)    : 0.000036 s
Time spent in additional calculations                            : 0.000001 s
Total time spent                                                 : 0.000136 s

Statistics:
===========
Parallel Direct Factorization is running on 4 OpenMP

< Linear system Ax = b >
             number of equations:           6
             number of non-zeros in A:      18
             number of non-zeros in A (%): 50.000000

number of right-hand sides: 1

< Factors L and U >
             number of columns for each panel: 128
             number of independent subgraphs: 0
             number of supernodes:                    3
             size of largest supernode:               3
             number of non-zeros in L:                19
             number of non-zeros in U:                5
             number of non-zeros in L+U:              24
             gflop   for the numerical factorization: 0.000000

gflop/s for the numerical factorization: 0.000718

The solution of the system is:
x [0] = 0.000000
x [1] = 0.000000
x [2] = 0.000000
x [3] = 0.000000
x [4] = 0.000000
x [5] = 0.000000
Relative residual = -nan

TEST PASSED

William_D_2 · ‎07-10-2017

OK so I tried running your solver. And it works in its current state. But i have tried setting iparm[1]=10 causes a failure in the msg system. although I am still getting a 0 residual. This is strange to me.

Also the code seems to work fine on smaller matricies of my data as well. It fails at larger matricies, This was my motivation for sending test data.

Although I have some more sample codes, from the Intel MKL Examples that fail when iparam[1]=10. I am using the 2017 update 4 compilers not the 2018 edition.

compile line is similar to yours, just changed it to use the environmental variable MKLROOT:

Serial Version

compile line - mpiicpc -Wall -DMKL_ILP64 -I$MKLROOT/include cl_solver_unsym_c.c -Wl,--start-group "$MKLROOT/lib/intel64"/libmkl_blacs_intelmpi_ilp64.a "$MKLROOT/lib/intel64"/libmkl_intel_ilp64.a "$MKLROOT/lib/intel64"/libmkl_core.a "$MKLROOT/lib/intel64"/libmkl_intel_thread.a -Wl,--end-group -L "$MKLROOT/../compiler/lib/intel64" -liomp5 -mt_mpi -lm

Output -

ERROR during symbolic factorization: -2
TEST FAILED

Distributed Verison

compile line - mpiicpc -Wall -DMKL_ILP64 -I$MKLROOT/include cl_solver_unsym_distr_c.c -Wl,--start-group "$MKLROOT/lib/intel64"/libmkl_blacs_intelmpi_ilp64.a "$MKLROOT/lib/intel64"/libmkl_intel_ilp64.a "$MKLROOT/lib/intel64"/libmkl_core.a "$MKLROOT/lib/intel64"/libmkl_intel_thread.a -Wl,--end-group -L "$MKLROOT/../compiler/lib/intel64" -liomp5 -mt_mpi -lm

Output -

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 43904 RUNNING AT smic1
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 43904 RUNNING AT smic1
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Intel(R) MPI Library troubleshooting guide:
https://software.intel.com/node/561764
===================================================================================

If this works on you machine then there might be a problem with the underlying system, which I have minimal control over. If that is the case, Does the Intel Compiler use system installed libraries for C++/C/Fortran? If so what versions do you have? Also what version of linux? We are on an Redhat Enterprise Edition.

Ying_H_Intel · ‎07-25-2017

Hi William,

Just let you know, i can reproduce the problems you reported

For example,

build the cl_solver_unsym_distr_c.c . get = EXIT CODE: 11
Build the cl_solver_unsym_c.c get ERROR during symbolic factorization: -2
TEST FAILED

The issue was escalated to our developer, i will keep you update if any news.

Best Regards,

Ying

William_D_2 · ‎08-28-2017

Ying,

Sorry to bump this but i am getting to the point of no return on my project. Do you have any updates on the progress of this fix? If it is not going to be repaired soon I will need to switch solvers (and rework a significant portion of my program).

Thanks,

Will

Ying_H_Intel · ‎08-31-2017

Hi Will,

The issue is fixed. It was targeted to be release in MKL 2018 update 1. You may note the announcement of release in the forum.

If any questions, please go to on-line server center http://www.intel.com/supporttickets for more information.

Thanks

Ying

William_D_2 · ‎09-01-2017

I will try to join up with the beta program. This would save me a large headache.

William_D_2 · ‎09-03-2017

Ok last question, related to the beta program. How long before this fix will appear in the beta tests. I just installed the beta compilers and am still getting the same error with your example.

Gennady_F_Intel · ‎09-04-2017

William,

beta program is finished one month ago. Right now we are final stage of preparation the newest version of MKl 2018 which we are planning to release within a few weeks. We will post an announcement at the top of this forum when this will happen.

wbr, Gennady

William_D_2 · ‎10-08-2017

Ok last bump I promise.

Was there any change the patch was released as part of 2017 update 5? I am kinda under the gun with completing my project and I think the Intel solver is the only way to do it.

Ying_H_Intel · ‎10-10-2017

Hi William,

i checked again. The issue is supposed to be fixed in 2018 update 1, which should be ready in Nov. Is it ok for your project?

If very urgent, please create one ticket in official https://software.intel.com/en-us/support/online-service-center.

Best Regards,

Ying

William_D_2 · ‎11-17-2017

OK ran the posted examples that are broken, again using the 2018 update 1 Parallel Studio XE Cluster Edition. It failed on both Windows and Linux. Are we sure the patch got released?