Help with LLS

vikrantca · ‎10-14-2009

Problem: DGELSS does not seem to work with large datasets.

I am using the following: MKL 10.2 with VS 2008, Managed C++.

The call from C++ to MKL is fine - tested it with small datasets (upto 100 rows and 5 columns) and verified the results (LLS coefficients) with Matlab and Excel Regression.

Attached is the dataset that I am having trouble with. It has 16383 rows and I get very strange coefficients from MKL.
The first 5 columns are the coefficient matrix and the last column is the y value. I am basically trying to fit the following equation: y = a+b*x + c*x^2 + d*x^3 + e*x^4.

I get drastically different coefficients when I use Excel Data Analysis Toolpak Regression feature.

Any guidance is greatly appreciated. Thanks in advance.

Gennady_F_Intel · ‎10-16-2009

Quoting - vikrantca

Problem: DGELSS does not seem to work with large datasets.

I am using the following: MKL 10.2 with VS 2008, Managed C++.

The call from C++ to MKL is fine - tested it with small datasets (upto 100 rows and 5 columns) and verified the results (LLS coefficients) with Matlab and Excel Regression.

Attached is the dataset that I am having trouble with. It has 16383 rows and I get very strange coefficients from MKL.
The first 5 columns are the coefficient matrix and the last column is the y value. I am basically trying to fit the following equation: y = a+b*x + c*x^2 + d*x^3 + e*x^4.

I get drastically different coefficients when I use Excel Data Analysis Toolpak Regression feature.

Any guidance is greatly appreciated. Thanks in advance.

Vikrant,
Could you check how this code will works with C++?
--Gennady

vikrantca · ‎10-16-2009

Quoting - Gennady Fedorov (Intel)

Vikrant,
Could you check how this code will works with C++?
--Gennady

I am not sure, I understand your question. Please can you provide more info.

Below is my code:

Int32 SolveGELSS(Int32 inM, Int32 inN, array^A, array^ B)
{

int m = inM;
int n = inN;

int nrhs = 1;
int lda = m;
int ldb = Math::Max(m,n);
double rcond = -1;
int lwork = 2;
int rank =0;
int i=0;

double *a = new double[m*n];
this->ConvertToColMajor(A, m,n,a);

int info = 0;
double *work = new double[lwork];

// Use ldb here, since it is max(m,n);
double *s = new double[ldb];

// Copy b to local array
double *b = new double;
for(i = 0; i < m; i++)
b = B;

// Workspace query

lwork = -1;
DGELSS(&m,&n, &nrhs, a,&lda, b, &ldb,s, &rcond, &rank, work, &lwork, &info);

lwork = (int)work[0];
delete[]work;

work = new double[lwork];

DGELSS(&m,&n, &nrhs, a,&lda, b, &ldb,s, &rcond, &rank, work, &lwork, &info);

if(info == 0)
{
for(i = 0; i < m; i++)
B = b;
}

// Memory clean up

}

Many thanks.

Michael_C_Intel4 · ‎10-23-2009

You're right, coefficients seem to be bogus, I get somewhat like

-2.492143e-17 6.400757e-18 -8.539396e-19 -6.115223e-18 2.546998e-17

This can be explained by tooimbalanced matrix: the values are different in order by more than machineprecision (~1.0e-16). The algorithm becomes highly unstable. But it can be easily worked around. I would recommend you balancing the matrix before applying DGELSS - for instance, in this case:

for( i = 0; i < M; i++ ) A[1] *= 1.0e-6;
for( i = 0; i < M; i++ ) A[2] *= 1.0e-11;
for( i = 0; i < M; i++ ) A[3] *= 1.0e-16;
for( i = 0; i < M; i++ ) A[4] *= 1.0e-21;
for( i = 0; i < M; i++ ) B *= 1.0e-6;

Then you would get results,which seem to be correct (after back scaling):

-4.103898e+04 4.109750e-02 -4.553319e-07 4.735998e-12 -1.904054e-18

Michael.

vikrantca · ‎10-26-2009

Works nicely. Many thanks Michael.

vikrantca · ‎10-28-2009

Quoting - vikrantca

Works nicely. Many thanks Michael.

One naive question - On output from GELSS, the rows of matrix B (from n+1 to m) will also need to be inverse-scaled to get the correct residuals, right?

Many thanks.

Michael_C_Intel4 · ‎10-29-2009

Quoting - vikrantca

One naive question - On output from GELSS, the rows of matrix B (from n+1 to m) will also need to be inverse-scaled to get the correct residuals, right?

Many thanks.

Yes, it should be scaled back. Here's the theory:

- initial system of equations is: A*x = b, you need solution to minimize r = || A*x - b || and residual r.

- you solve the system A*cA*x' = b*cb instead, where cA: n-by-n, cA = diag( cA1, ..., cAn );cb: 1-by-1 scalar

- the solution of the modifed system: x' = inv( cA )*x*cb, that is x = inv( cb )*x'*cA

- by substituting solution x' into modified system's residualwe get r' = || A*cA*x' - b*cb || = || A*x*cb - b*cb || = || A*x - b || * cb = r*cb, that is r = r'*inv( cb ), so

you collect sum-of-squares from DGELSS (that is,of modified system - r'), than scale it back with the coefficient you scaled b vector with.

Michael_C_Intel4 · ‎10-29-2009

To be more precise, you need to scale back not r', but r'^2 (square of r'), so apparently it should be scaled by inv( cb )^2, that is r^2 = r'^2 * inv( cb )^2