Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

multiple right hand side

apocalx
Beginner
2,197 Views
I'm using Intel MKL 10.3 update 1.
When I cmpare time to solve Ax=b with Pardiso, I get slower time with multiple right hand side than in single rhs.
I made many tests with different matrix size and number of non zero values and I get always same behavior.
Matrix is symmetrical
Matrix and vector B are fill with complex number
Matrix size : 10000x10000
Number of non zero value per row : 3
All non zero values are on the diagonal and near the diagonal
Vector B is fill with 1+j0 at each index. (to simplify the test)
Reordoring and Factorisation are runned only 1 time.
Time to solve 10000 times the system with 1 rhs : 4.485 sec.
Time to solve 10000/8 times the system with 8 rhs: 15.89 sec.
Time to solve 10000/32 times the system with 32 rhs: 7.672 sec.
Time to solve 10000/128 times the system with 128 rhs: 6.89 sec.
Is it normal that the faster result is always wit 1rhs no matter the matrix size, the number of non zero and values in B vector ?
I'm on windows XP, visual studio 2005 and 2010
Intel Core 2 Quad CPU Q9550 @ 2.83GHz 3.25 GB RAM
0 Kudos
21 Replies
Chao_Y_Intel
Moderator
2,032 Views

Hello,

From the tests, the result seems fine. The test repeated on the same data set. The memory access performance is important on the solver stage. If you only have one right hand date, this data will be easier to keep into the cache after the first run, which will have better performance, than it has more multiple right hands data. I am checking with the function owner for more details on this.

Thanks,
Chao

0 Kudos
apocalx
Beginner
2,032 Views
Any news?
In fact, my problem is the same than Bosun Hwang posted last year (http://software.intel.com/en-us/forums/showthread.php?t=76809&o=a&s=lr) but with MKL version 10.3 update 3.
So, is it normal than many right hand side was slower than 1 rhs in mkl 10.3 update 3?
Thanks?
MarcB
0 Kudos
Sergey_Solovev__Inte
New Contributor I
2,032 Views
I agree with Chao about cache. Is behavior the same on other sizes? For example: matrix size 100 000x100000; 1, 2, 4, 8 rhs and 100,50,25,12 times. Or smaller matrix 1000x1000; 1, 2, 4, 8 rhs and 100 000,50 000,25 000,12 000 times?
SergeyS.
0 Kudos
apocalx
Beginner
2,032 Views
The behavior is the same with matrix 5000x5000, 10000x10000, 50000x50000
n solve * (1 rhs) is always faster thann/k solve * (k rhs)
This logical is true with k <= 256. Bigger than that, I have some memory problems.
where"k" is the number of right hand side
In my mind, multiple rhs should be faster than 1 rhs, no?
MarcB
0 Kudos
apocalx
Beginner
2,032 Views
Anyone is this forum can reproduce same behavior than me with mutiple rhs? Or another behavior?

Because, with my conclusion of multiple rhs, I don't see interest to use it instead of many single rhs.
Thanks
MarcB
0 Kudos
Sergey_Solovev__Inte
New Contributor I
2,032 Views

We reproduced this issue on test with small number of non- zero value per row (number is 3). There is no problem with tests which have many number of non zero value per row. Do you have problem if non zero value per row more than 100?

0 Kudos
apocalx
Beginner
2,032 Views
Times seem better with a big number of non zero values.
Here are some benchmark result with symmetrical matrix and complex number :

non zero values per line = number of non zero values in diagonal and upper triangle per line
Matrix size 5000 x 5000 with 2 non zero values per line :
rhs = 1 => 1.172 sec.
rhs = 2 => 10.515sec.
rhs = 4 => 5.718sec.
rhs = 8 => 3.000sec.
rhs = 16 => 2.188sec.
rhs = 32 => 1.593sec.
rhs = 64 => 1.469sec.
rhs = 128 => 1.563sec.
Conclusion : Slower with multiple rhs than single rhs no matter number of rhs
Matrix size 5000 x 5000 with 10 non zero values per line :
rhs = 1 => 9.734 sec.
rhs = 2 => 33.140 sec.
rhs = 4 => 19.734 sec.
rhs = 8 => 15.687 sec.
rhs = 16 => 13.203 sec.
rhs = 32 => 12.406 sec.
rhs = 64 => 12.531 sec.
rhs = 128 => 15.734 sec.
Conclusion : Slower with multiple rhs than single rhsno matter number of rhs
Matrix size 5000 x 5000 with 100 non zero values per line :
rhs = 1 => 116.248sec.
rhs = 2 => 126.904sec.
rhs = 4 => 89.264sec.
rhs = 8 => 72.514sec.
rhs = 16 => 67.468sec.
rhs = 32 => 60.061sec.
rhs = 64 => 58.186sec.
rhs = 128 => 90.030sec.
rhs = 256 => 117.185sec.
Conclusion : Faster with multiple rhs than single rhs, but multiple rhs must be <= 64 and != 2
Matrix size 20000 x 20000 with 50 non zero values per line :
rhs = 1 => 1143.416sec.
rhs = 2 => 1451.363sec.
rhs = 4 => 955.747sec.
rhs = 8 => 789.063sec.
rhs = 16 => 695.283sec.
rhs = 32 => 632.784sec.
rhs = 64 => 626.066sec.
rhs = 128 => 831.969sec.
Conclusion : Faster with multiple rhs than single rhs, but multiple rhs must be <= 64and != 2
The more nnz values per line are in matrix, faster is multiple rhs solving.
Unfortunately, in my case all my matrices are highly sparse. (Between 3 and 10 non zero values per line no matter the matrix size) and I have to solve it with many different B, so I need good performance with multiple right hand side with small number of nnz per line.
Thanks
MarcB
0 Kudos
Sergey_Solovev__Inte
New Contributor I
2,032 Views

Thanks for the detailed response!
Could you provide us with info about you? We will create request and add this info into this request.
It would be nice if you provide us with your test case. We will add it in request as well.
Sergey Solovev.

0 Kudos
apocalx
Beginner
2,032 Views
Name: Marc Belletete
Company: Cyme
Email: marc.belletete@cyme.com
Envioronment: Visual Studio 2010
I attached a benchmark .cpp file.
Here is another subject :
All our matrices are highly sparse (between 3 and 10 nnz values per line) and almost all nnz values are on diagonal or very close of diagonal.
With this kind of matrix, I saw that KLU solver ( non symmetrical ) is, maybe, 1.5x faster than intel mkl pardiso (symmetrical).
However, KLU have problems that Pardiso don't have (problem to solve singular matrix).
I would like your opinion about that.
Note: KLU is a free solver.
Thanks
MarcB
0 Kudos
Gennady_F_Intel
Moderator
2,032 Views
Which comparative resultsyou have withthese solvers (KLU vs Pardiso)?
0 Kudos
apocalx
Beginner
2,032 Views
I made many tests with matrix size 20000x20000 and 50000x50000
My matrices were symmetrical, so I used Pardiso symmetrical and KLU non-symmetrical (not sure that KLU symmetrical exist). The number of non zero values were between 3 and 10 per line). Almost all my non zero values were on the diagonal and close to the diagonal.
MarcB
0 Kudos
Sergey_Solovev__Inte
New Contributor I
2,032 Views
Thanks for detailed info!
We added it into our request.
0 Kudos
Andrey_Kuzmin__Intel
2,032 Views

Hello Marc,

I did investigation of the problem and found much better instrument for this kind of matrices.

The reason of relatively poor performance of multiple RHS solve is simple.

We have two types of parallelism inside PARDISO solving step. The first is used for problems with 1 RHS (the work for the only RHS is split between cores) and the second is used for systems with many RHS (many RHS are solved in parallel, but the sequential algorithm is used for each of them). There is no mixed version for the problems with few RHS. For instance, if we have 2 right hand sides and 4 cores, the problem will be solved in parallel using 2 cores, but 2 cores will remain idle.

The second reason of poor performance for such matrices is that amount of work for each RHS is too small. We dont even compensate the threading overhead. Indeed, the sequential version of PARDISO is faster than parallel for such problems (set OMP_NUM_THREADS=1 to check it). If we increase the number of RHS or nnz per line, the parallel version shows better performance as you mentioned.

In fact, matrices of size 10000x10000 are considered to be very small for PARDISO. If you always have all non-zero elements very close to diagonal, I would recommend you to use MKL LAPACK solver with band matrix storage scheme (zgbtrf and zgbtrs stay for factorization and solve correspondingly). Unfortunately, there is no storage scheme in LAPACK that takes advantage of both banded structure and symmetry (the latter has to be sacrificed), but the memory consumption seems not to be critical here. I rewrote the benchmark so that it does exactly the same computations, but uses these two functions instead of PARDISO (see attachment). It demonstrates much higher performance (at least few times faster, but sometimes even an order of magnitude faster). In fact there is the same issue with 2 RHS on 4 cores, but overall results situation is much better.

Best regards,

Andrey Kuzmin.
0 Kudos
Andrey_Kuzmin__Intel
2,032 Views

Here is the attachment.

0 Kudos
apocalx
Beginner
2,032 Views
In my case, maybe 90% of non zeros values are on diagonal or near the diagonal. The other 10% are anywhere in the upper triangle. (number of non zero values is still about 7 per lines)
Is it correct if I use LAPACK equivalent functions, with Packed storage or Full Storage?
With this kind of storage (instead of Band storage), do multiple rhs will give me interesting time? (Faster the pardiso and comparable to band storage with multiple rhs)?
Thanks
Marc
0 Kudos
Andrey_Kuzmin__Intel
2,032 Views

Marc,

I see your question andwill investigate it soon.

Sincerely,
Andrey.

0 Kudos
apocalx
Beginner
2,032 Views
Jus to resume the situation :
With MKL, I want to solve Ax=b with many b with the same A where A is a higly sparse matrix (symmetrical or not).
1- I can't use MKL LAPACK functions because LAPACK don't have a schematic for sparse matrix, and it will take too much memory to have a schematic of all values in my matrix (including all zero values)
2- I have to use Pardiso to factorize only 1 time de matrix and to solve it many time aftre that. (work fine but slow)
3- I can use PARIDSO multiple right hand side to solve but it will be slower than solving 1 right hand side a time for reason explain in this post.
4- In my case, I only need some values in x solution vector (position that I need change for each solve). I can use PARDISO partial solution, but it will be slower than solve all values in x vector because I need to re-factorize the matrix for each solve if I use partial solution.
So, I don't know if there is another solution for my kind of situation?
I try some other solver andI don't know how KLU solver work, but if I try to solve the same system :
1 factorization
solve many B vector
1 right hand side a time
Using only 1 processor (no parallelisme)
KLU is 3 times faster then Pardiso for this king of matrix. (http://www.cise.ufl.edu/research/sparse/klu/)
I'm waiting your comments
Thanks a lot for your time
Marc
0 Kudos
basel
Beginner
2,032 Views
Marc,

>4- In my case, I only need some values in x solution vector (position that I need change for each solve). I can >use PARDISO partial solution, but it will be slower than solve all values in x vector because I need to re-factorize >the matrix for each solve if I use partial solution.

You should try to use the release 4.1.3 from www.pardiso-project.org.

We have recently implemented a feature that exactly is a good solution for (4).

Olaf
0 Kudos
Alexander_K_Intel2
2,032 Views
HI Marc,
About 4th question: Am I right that numbers of components solution vector that you need to find using PARDISO differ from step to step? I've asked it because if these numbers doesn't change you could use partial solution in pardiso solver with only one call factorization and reordering steps.
With best regards,
Alexander Kalinkin
0 Kudos
apocalx
Beginner
1,819 Views
For each solve, I only need 3 psotions in solution vector. But these 3 positions change for each solve.
Marc
0 Kudos
Reply