Using MKL functionality to replace IPP small-matrix QR solver

Weiss__Daniel · ‎12-05-2016

Hi,

moving from IPP 7.0 to IPP 2017, I have to replace the discontinued small-matrix functions ippmQRDecomp_m_64f() and ippmQRBackSubst_mv_64f() with suitable functionality from the MKL (as per Intel suggestion). I am using these functions to solve min(||A*x - b||_2) problems.

With IPP 7.0, I have (a) called ippmQRDecomp_m_64f() to QR-decompose the matrix A, (b) stored the resulting decomposition, and (c) repeatedly used the decomposition afterwards (calling ippmQRBackSubst_mv_64f()) to solve for specific vectors b.

For performance reasons, I want to maintain this approach of doing the decomposition only once, and using it repeatedly afterwards, so probably calling the ?gels driver routine is suboptimal. The MKL manpage for ?geqrf suggests calling the MKL functions LAPACKE_dgeqrf, LAPACKE_dormqr and cblas_dtrsm to solve linear least-squares problems.

I haven't used LAPACK before and didn't find a sample for this on the net. Is there sample code showing how to do this?

Thanks for your help,
Daniel

mecej4 · ‎12-05-2016

Please see https://software.intel.com/en-us/node/521004#1AE9C452-7029-425E-90DB-1E4A962E2069 . You can do the decomposition once with a call to ?geqrf(), and then repeatedly call ORMQR to compute y = Q^Tb for each vector b as you see it, followed by a call to ?trsm() to solve R x = y. You can also assemble the b vectors into a matrix B, and compute Y = Q^TB and solve R X = Y in a "solve them all when ready" approach.

Weiss__Daniel · ‎12-05-2016

Hi,

thank you so much. As I said, I did find the ?geqrf manpage where it says all that. I was hoping to find some sample code, especially for ?trsm, where the manpage is somewhat confusing. For example, is the output generated by ?geqrf unit triangular? ?trsm wants to know.

Thank you,

Daniel

mecej4 · ‎12-06-2016

There is an example in Fortran in the MKL examples/lapackf/ directory.

The source file is "source/sormqrx.f" and the data is "data/sormqrx.d". There is a utility library in examples/lapackf/lib that is needed by this example.

You may be find a similar example in C by searching the Web. One hitch is that the Fortran routines SGEQRF and SORMQR are in Lapack, whereas STRSM is in BLAS, so the corresponding C routines may be split between Lapack_e and CBLAS.

For example, is the output generated by ?geqrf unit triangular?

The Q-R factorization produces an orthogonal matrix Q, i.e., Q satisfies Q^TQ = I. If c is a scalar, and A₁ = c A, the Q-R factorizations of A and A₁ produce the same Q. The scaling is completely contained in the R factor. Therefore, R does not have a unit diagonal except in special cases.

mecej4 · ‎12-06-2016

I assembled an example in C for you. Here is the test program, which does a two-part calculation. In the first part, the Q-R factorization is performed, and then used to obtain the solutions for multiple right hand sides. In the second part, the previously computed factors are used repeatedly to obtain solutions for different (possibly new) right hand sides, one at a time.

#include <stdlib.h>
#include <stdio.h>
#include "mkl_lapacke.h"
#include "mkl.h"

/* Auxiliary routines prototypes */
extern void print_matrix( char* desc, MKL_INT m, MKL_INT n, float* a, MKL_INT lda );
extern void print_vector_norm( char* desc, MKL_INT m, MKL_INT n, float* a, MKL_INT lda );

/* Parameters */
#define M 6
#define N 4
#define NRHS 2
#define LDA N
#define LDB NRHS

/* Main program */
int main() {
	/* Locals */
	MKL_INT m = M, n = N, nrhs = NRHS, lda = LDA, ldb = LDB, info;
	int i,j;
	/* Local arrays */
	float A[LDA*M] = {
      -0.57, -1.28, -0.39,  0.25,
      -1.93,  1.08, -0.31, -2.14,
       2.30,  0.24,  0.40, -0.35,
      -1.93,  0.64, -0.66,  0.08,
       0.15,  0.30,  0.15, -2.13,
      -0.02,  1.03, -1.43,  0.50
      };
	float B[LDB*M] = {
      -3.15,  2.19,
      -0.11, -3.64,
       1.99,  0.57,
      -2.70,  8.23,
       0.26, -6.35,
       4.50, -1.48 
	 };
    float tau,b,Bsav[LDB*M];

	 printf( "LAPACKE Least Squares Example Program\n" );
	 /* Solve the overdetermined equations A X = B using computational routines */
	 /* Part-1: equivalent to calling LAPACKE_sgels */
	 /* Phase-1: Q-R decomposition */
	info = LAPACKE_sgeqrf( LAPACK_ROW_MAJOR, m, n, A, lda, tau);
	if( info < 0) {
		printf( "The %d-th argument to Lapacke_SGEQRF had an error.\n",-info );
		exit( 1 );
	}
	/* Print details of QR factorization */
	print_matrix( "Details of QR factorization", m, n, A, lda );
	/* Keep a copy of B for later use */
	for(i=0; i<NRHS*M; i++)Bsav=B;
	
	/* Phase-2 : Compute C = Q' B */

	info = LAPACKE_sormqr(LAPACK_ROW_MAJOR,'L','T',m,nrhs,n,A,lda,tau,B,ldb);
	if (info ) {
		 printf("The %d-th argument to Lapacke_SORMQR had an error.\n",-info);
		 exit(1);
		}
	print_matrix( "Q^T x B",n,nrhs,B,ldb);
	
	/* Phase-3: Solve R X = C and compute norm of residuals */
	
	cblas_strsm(CblasRowMajor,CblasLeft,CblasUpper,CblasNoTrans,CblasNonUnit,
	               n,nrhs,1.0f,A,lda,B,ldb);

	print_matrix( "Least squares solution", n, nrhs, B, ldb );
	print_vector_norm( "Residual sum of squares for the solution", m-n, nrhs,
			&B[n*ldb], ldb );
	
    /*	Part-2: Repeat phases 2 and 3 for one column of B at a time, in any order */
	for(i=0; i<NRHS; i++){
		for(j=0; j<M; j++)b=Bsav[j*NRHS+i];
		info = LAPACKE_sormqr(LAPACK_ROW_MAJOR,'L','T',m,1,n,A,lda,tau,b,1);
		cblas_strsm(CblasRowMajor,CblasLeft,CblasUpper,CblasNoTrans,CblasNonUnit,
	                   n,1,1.0f,A,lda,b,1);
	    printf("\n Least squares solution for data set %d\n",i+1);
		for(j=0; j<n; j++)printf(" %6.2f",b); printf("\n");
		print_vector_norm(" RSSQ for this set",m-n,1,b+n,1);
	    }
	exit( 0 );
}

/* Auxiliary routine: printing a matrix */
void print_matrix( char* desc, MKL_INT m, MKL_INT n, float* a, MKL_INT lda ) {
	MKL_INT i, j;
	printf( "\n %s\n", desc );
	for( i = 0; i < m; i++ ) {
		for( j = 0; j < n; j++ ) printf( " %6.2f", a[i*lda+j] );
		printf( "\n" );
	}
}

/* Auxiliary routine: printing norms of matrix columns */
void print_vector_norm( char* desc, MKL_INT m, MKL_INT n, float* a, MKL_INT lda ) {
	MKL_INT i, j;
	float norm;
	printf( "\n %s\n", desc );
	for( j = 0; j < n; j++ ) {
		norm = 0.0;
		for( i = 0; i < m; i++ ) norm += a[i*lda+j] * a[i*lda+j];
		printf( " %6.2f", norm );
	}
	printf( "\n" );
}

Weiss__Daniel · ‎12-07-2016

Hi mecej4,

thank you very much for the sample. It turns out I got the 6th parameter of trsm wrong, must have mixed up my matrices. Making the changes according to your sample, I now get a polynom fitter with results within numerical accuracy of MATLAB's A\b, so that's alright.

Having not used LAPACK before, for a first-time user I think the library performance is harder to unlock than say for the IPP library. The habit of overwriting matrices is also disconcerting. I guess that is how they did it back when FORTRAN was the new kid on the block...

Thanks again,

Daniel

Weiss__Daniel · ‎12-07-2016

Hi,

one more thing: how come the IPP guys managed to store all QR data in a same-size matrix (no additional storage required for ippmQRDecomp_m()), and LAPACK ?geqrf requires the additional tau vector?

Puzzled,

Daniel

mecej4 · ‎12-07-2016

Short answer: History and Culture.

Lapack and BLAS were created in the 1970s and 1980s, almost exclusively in Fortran. C was not widely used for numerical work at the time. In Fortran 66 and 77, the only way to manage large work arrays was to allocate a huge common block in the main program and manage indices into various blocks of that memory. That is why you will see lots of Lapack and BLAS routines with two puzzling (to C programmers and very young Fortran programmers) types of arguments: work arrays and "leading dimension" of multidimensional array arguments. These are not required in C or modern Fortran, but it is not easy to "fix" old stuff without breaking a lot of working code that controls nuclear missiles, reactors, medical information systems, etc.

IPP has no such history, and is entirely under Intel's control. It is most easy to use from C, which has always had malloc() and other ways of allocating memory.

Most MKL BLAS and Lapack routines now have Fortran 95 interfaces, as you can see in the MKL Fortran documentation. In these versions, the argument lists are shortened, some arguments are optional, and work arrays are managed behind the scenes.

All this, I am sure, is a cause for much confusion to C programmers, but I see no way of avoiding these issues. "Go to war with the army you have! Work with the president we elected!", and so on is what we must do.

Another discussion of this and related topics: http://drsfenner.org/blog/2015/12/into-the-weeds-i-lapack-lapacke-and-three-paths/ .

Weiss__Daniel · ‎12-08-2016

Hi mecej4,

thanks for the link to the blog, good stuff! The math was the same in 1950, though, so I am still wondering where that tau array goes in the IPP QR decomp implementation. There just seems to be no way to store it in-place.

As a not-so-young C programmer, I can say Intel SW support has always been great for me. Some years ago, the IPP developers even put in new functionality just for us... Of course, they've been taking that out again when they dropped the "small matrix and rendering" library.

MKL looks to be a good army too. Don't drop your punch cards,

Daniel

mecej4 · ‎12-08-2016

(name withheld) wrote:

The math was the same in 1950, though, so I am still wondering where that tau array goes in the IPP QR decomp implementation. There just seems to be no way to store it in-place.

I am not an IPP user, but the documentation of ippmQRDecomp shows that there is an argument pBuffer, which probably corresponds to the array tau[]. The Lapack Q-R decomposition does not actually return Q in the lower triangle, but the sequence of Householder reflectors from which Q can be computed, or applied to an RHS vector or matrix, as needed. The Householder reflectors are

Hn = I - (2/r_n) Vn * VnT

The multipliers r_n need space, and that is the role of the arrays tau or pBuffer. However, as I noted earlier, a least-squares solver that does both the factorization and the back-substitution, such as Lapacke_sgels(), does not need to be provided tau as an argument. Similarly, as you can see at https://software.intel.com/en-us/node/469160 , the Fortran-95 call can be as simple as call gels(a, b) .

Additional references:

1. https://www.inf.ethz.ch/personal/gander/papers/qrneu.pdf

2. http://blogs.mathworks.com/cleve/2015/06/15/algol-60-pl0-and-matlab/#5df6e13c-9e12-4979-b96d-1b4b7f9f9c06

3. http://www.dm.unibo.it/~guerrini/html/an_09_10/QR_50_years_later.pdf

Weiss__Daniel · ‎12-08-2016

Hi mecej4,

in IPP 7, the pBuffer array is described as "pre-allocated auxiliary array to be used for internal computations" (manpage for QRDecomp). So this is pre-allocated storage that doesn't have to be allocated by IPP inside QRDecomp, saving time and possibly memory fragmentation if you have a large number of calls. It's just scratch space provided by the function caller. When calling the back-substitution function, the content of the pBuffer array can be _arbitrary_ (specifically, different from what QRDecomp wrote there).

The tau array from ?geqrf needs to be stored and provided to ?ormqr for the back-substitution. So in this case, the tau array is additional and required storage for the decomposition.

Best regards,

Daniel

mecej4 · ‎12-08-2016

I see. The r_n can be recomputed on the fly, and that may require a small overhead for the small matrices that IPP is intended to be used with. Lapack, however, is intended for big matrices, such as those that arise from discretizing a partial differential equation in two-dimensional space, so they had to avoid costly re-computations as well as minimize the memory consumption.

The mainframes of the 1970s ran Fortran programs in which all variables were, in C terms, "static". I would not be surprised if the big-iron hardware had no stack support, and recursion was prohibited. There were even CPUs with no registers. In such an environments, lots of things were done differently.

Weiss__Daniel · ‎12-09-2016

Hi mecej4,

I can confirm that the tau array returned by ?geqrf does indeed contain the factors 2/(norm(V_i)^2), with V_i the vectors stored (except for 1 and 0 entries) in the lower left part of the matrix A returned by ?geqrf. So tau can be recomputed from this data, as you said.

Thanks for the help, and the additional reading material. This has helped me a lot.

Best regards,

Daniel