Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library
- Parallel execution, CPU time of PARDISO for multiple rhs and wrong result of LAPACK dgetrf

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Jinwook_Kim

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-15-2011
11:21 AM

89 Views

Parallel execution, CPU time of PARDISO for multiple rhs and wrong result of LAPACK dgetrf

First of all, please understand my poor english.

I have two problems.

I use the PARDISO to solve a sparse matrix.

Just one sparse matrix and multiple nrhs...

(typically, 10k~10M dimension and 100~10k rhs)

So, phase=33 is used to solve iteratively.

Problem is that,, PARDISO gives just 1.6X faster result whan compared to sequential mode (4 core used).

(dimension 45k, 35 rhs, time solve: 0.13s vs 0.23s, time total: 0.5s vs 0.8s)

I expected more than 3X faster result because each column of B can be solved independently from each other.

What is problem?

If I execute PARDISO in parallel for each column, can i get better speed?

Can I share arguments for PARDISO when i execute it in that manner?

eg>

phase = 12

PARDISO(...)

for each column of B

PARDISO(...) <-- parallel for each column

end

I use dgetrf and dgetri to inverse a unsymmetric dense matrix.

Problem is that, when linked with intel_mkl_thread.a, dgetri gives wrong results.

Other LAPACK functions give correct results except dgetri.

My code is...

[cpp]typedef struct MATRIX{ double *data; int nRow; int nCol; int isSet; int ld; int mCol; } MATRIX; int inverseMatrix(MATRIX *M){/*{{{*/ int *ipiv; int lwork; int info; int ispec = 1; double *work; if (M->nRow != M->nCol) return 0; ipiv = (int *) calloc (M->nRow, sizeof(int)); dgetrf(&(M->nRow), &(M->nCol), M->data, &(M->ld), ipiv, &info); if (info < 0){ fprintf(stderr, "%d element of Matrix is illegal value!\n", -info); return 0; } else if (info > 0){ fprintf(stderr, "Factored matrix is singular!\n"); return 0; } lwork = ilaenv(&ispec, "dgetri", "", \ &(M->nRow), &(M->nCol), &info, &info); lwork = lwork * M->nRow; work = (double *) calloc(lwork, sizeof(double)); dgetri (&(M->nRow), M->data, &(M->ld), ipiv, work, &lwork, &info); if (info < 0){ fprintf(stderr, "%d element of Matrix is illegal value!\n", -info); free(work); free(ipiv); return 0; } else if (info > 0){ fprintf(stderr, "Factored matrix is singular!"); free(work); free(ipiv); return 0; } free(work); free(ipiv); return 1; }/*}}}*/[/cpp]

When i checked line-by-line using gdb, dgetrf (linked with threaded lib) gave correct result

and both threaded and sequential programs gave same results.

however, results of dgetri were different, and threaded dgetri was wrong.

What is the problem?

My system is...

Intel Core2Quad Q6600

6GB memory

Ubuntu 10.10 x86_64 (Linux 2.6.35-26-generic)

icc 12.0

MKL 10.3

compile option = -g -m64 -w -Wall

linking =-L/opt/intel/mkl/lib/intel64 -Wl,--start-group "/opt/intel/mkl/lib/intel64/libmkl_intel_lp64.a" "/opt/intel/mkl/lib/intel64/libmkl_intel_thread.a" "/opt/intel/mkl/lib/intel64/libmkl_core.a" -Wl,--end-group -L"/opt/intel/mkl/../compiler/lib/intel64" -liomp5 -lpthread -lm

Thanks,

Jinwook Kim.

Link Copied

3 Replies

Konstantin_A_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-15-2011
09:10 PM

89 Views

Let me try to answer your 1st question about PARDISO NRHS.

In general, the bigger NRHS value (number of RHS passed to PARDISO), the better performance can be expected. It also makes sense to choose NRHS so that NRHS = K * num_threads, where K is integer. So, I would recommend you to improve NRHS up to 64 or 128. Of course, it won't be possible for dimension=10M task due to high memory overhead, but at least try for task up to 1M.

I ran similar test, dimension=50K. It was ~2x solve speed-up on NRHS=32 and more than 3x on NRHS=128.

Also I should note that you use rather old processor (without integrated memory controller yet) that has less efficient memory subsystem. For example, running the same test on X5650 system (2x6 cores) lunched in Q1'2010 I obtained almost 4x speed-up on 4 threads and 8x speed-up on 12 threads for NRHS=128.

Regards,

Konstantin

Jinwook_Kim

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-16-2011
03:36 AM

89 Views

Thanks for your kind reply.

usually, dimension of matrix exceeds 1M, and QR decomposition process is followed to solved RHS, so large number of rhs causes speed degradation and memory problem. (usually, 10~20 is appropriate)

i'm going to try to set rhs to multiple of number of threads.

Thanks.

Gennady_F_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-16-2011
05:53 AM

89 Views

regarding the LAPACK's problem. That's pretty hard to say what's going wrong with dgetri. As for me, the code looks correct. Can You give us the standalone example which we can check on our side?

--Gnnady

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.