Parallel execution, CPU time of PARDISO for multiple rhs and wrong result of LAPACK dgetrf

Jinwook_Kim · ‎02-15-2011

Hello,

First of all, please understand my poor english.

I have two problems.

1. CPU time of PARDISO

I use the PARDISO to solve a sparse matrix.

Just one sparse matrix and multiple nrhs...

(typically, 10k~10M dimension and 100~10k rhs)

So, phase=33 is used to solve iteratively.

Problem is that,, PARDISO gives just 1.6X faster result whan compared to sequential mode (4 core used).

(dimension 45k, 35 rhs, time solve: 0.13s vs 0.23s, time total: 0.5s vs 0.8s)

I expected more than 3X faster result because each column of B can be solved independently from each other.

What is problem?

If I execute PARDISO in parallel for each column, can i get better speed?

Can I share arguments for PARDISO when i execute it in that manner?

eg>

phase = 12

PARDISO(...)

for each column of B

PARDISO(...) <-- parallel for each column

end

2. Wrong result of LAPACK dgetri when liked with thread library

I use dgetrf and dgetri to inverse a unsymmetric dense matrix.

Problem is that, when linked with intel_mkl_thread.a, dgetri gives wrong results.

Other LAPACK functions give correct results except dgetri.

My code is...

[cpp]typedef struct MATRIX{
        double *data;

        int nRow;
        int nCol;
        int isSet;

        int ld;
        int mCol;
} MATRIX;
int inverseMatrix(MATRIX *M){/*{{{*/
        int *ipiv;
        int lwork;
        int info;
        int ispec = 1;
        double *work;

        if (M->nRow != M->nCol) return 0;
        ipiv = (int *) calloc (M->nRow, sizeof(int));
        dgetrf(&(M->nRow), &(M->nCol), M->data, &(M->ld), ipiv, &info);
        if (info < 0){
                fprintf(stderr, "%d element of Matrix is illegal value!\n", -info);
                return 0;
        }
        else if (info > 0){
                fprintf(stderr, "Factored matrix is singular!\n");
                return 0;
        }

        lwork = ilaenv(&ispec, "dgetri", "", \
                                        &(M->nRow), &(M->nCol), &info, &info);
        lwork = lwork * M->nRow;
        work = (double *) calloc(lwork, sizeof(double));
        dgetri (&(M->nRow), M->data, &(M->ld), ipiv, work, &lwork, &info);
        if (info < 0){
                fprintf(stderr, "%d element of Matrix is illegal value!\n", -info);
                free(work);
                free(ipiv);
                return 0;
        }
        else if (info > 0){
                fprintf(stderr, "Factored matrix is singular!");
                free(work);
                free(ipiv);
                return 0;
        }
        free(work);
        free(ipiv);
        return 1;

}/*}}}*/[/cpp]

When i checked line-by-line using gdb, dgetrf (linked with threaded lib) gave correct result

and both threaded and sequential programs gave same results.

however, results of dgetri were different, and threaded dgetri was wrong.

What is the problem?

My system is...

Intel Core2Quad Q6600

6GB memory

Ubuntu 10.10 x86_64 (Linux 2.6.35-26-generic)

icc 12.0

MKL 10.3

compile option = -g -m64 -w -Wall

linking =-L/opt/intel/mkl/lib/intel64 -Wl,--start-group "/opt/intel/mkl/lib/intel64/libmkl_intel_lp64.a" "/opt/intel/mkl/lib/intel64/libmkl_intel_thread.a" "/opt/intel/mkl/lib/intel64/libmkl_core.a" -Wl,--end-group -L"/opt/intel/mkl/../compiler/lib/intel64" -liomp5 -lpthread -lm

Thanks,

Jinwook Kim.

Konstantin_A_Intel · ‎02-15-2011

HiJinwook Kim,

Let me try to answer your 1st question about PARDISO NRHS.

In general, the bigger NRHS value (number of RHS passed to PARDISO), the better performance can be expected. It also makes sense to choose NRHS so that NRHS = K * num_threads, where K is integer. So, I would recommend you to improve NRHS up to 64 or 128. Of course, it won't be possible for dimension=10M task due to high memory overhead, but at least try for task up to 1M.

I ran similar test, dimension=50K. It was ~2x solve speed-up on NRHS=32 and more than 3x on NRHS=128.

Also I should note that you use rather old processor (without integrated memory controller yet) that has less efficient memory subsystem. For example, running the same test on X5650 system (2x6 cores) lunched in Q1'2010 I obtained almost 4x speed-up on 4 threads and 8x speed-up on 12 threads for NRHS=128.

Regards,

Konstantin

Jinwook_Kim · ‎02-16-2011

Hi Konstantin,

Thanks for your kind reply.

usually, dimension of matrix exceeds 1M, and QR decomposition process is followed to solved RHS, so large number of rhs causes speed degradation and memory problem. (usually, 10~20 is appropriate)

i'm going to try to set rhs to multiple of number of threads.

Thanks.

Gennady_F_Intel · ‎02-16-2011

Hi Jinwook,

regarding the LAPACK's problem. That's pretty hard to say what's going wrong with dgetri. As for me, the code looks correct. Can You give us the standalone example which we can check on our side?

--Gnnady