From the tests, the result seems fine. The test repeated on the same data set. The memory access performance is important on the solver stage. If you only have one right hand date, this data will be easier to keep into the cache after the first run, which will have better performance, than it has more multiple right hands data. I am checking with the function owner for more details on this.
We reproduced this issue on test with small number of non- zero value per row (number is 3). There is no problem with tests which have many number of non zero value per row. Do you have problem if non zero value per row more than 100?
Thanks for the detailed response!
Could you provide us with info about you? We will create request and add this info into this request.
It would be nice if you provide us with your test case. We will add it in request as well.
I did investigation of the problem and found much better instrument for this kind of matrices.
The reason of relatively poor performance of multiple RHS solve is simple.
We have two types of parallelism inside PARDISO solving step. The first is used for problems with 1 RHS (the work for the only RHS is split between cores) and the second is used for systems with many RHS (many RHS are solved in parallel, but the sequential algorithm is used for each of them). There is no mixed version for the problems with few RHS. For instance, if we have 2 right hand sides and 4 cores, the problem will be solved in parallel using 2 cores, but 2 cores will remain idle.
The second reason of poor performance for such matrices is that amount of work for each RHS is too small. We dont even compensate the threading overhead. Indeed, the sequential version of PARDISO is faster than parallel for such problems (set OMP_NUM_THREADS=1 to check it). If we increase the number of RHS or nnz per line, the parallel version shows better performance as you mentioned.
In fact, matrices of size 10000x10000 are considered to be very small for PARDISO. If you always have all non-zero elements very close to diagonal, I would recommend you to use MKL LAPACK solver with band matrix storage scheme (zgbtrf and zgbtrs stay for factorization and solve correspondingly). Unfortunately, there is no storage scheme in LAPACK that takes advantage of both banded structure and symmetry (the latter has to be sacrificed), but the memory consumption seems not to be critical here. I rewrote the benchmark so that it does exactly the same computations, but uses these two functions instead of PARDISO (see attachment). It demonstrates much higher performance (at least few times faster, but sometimes even an order of magnitude faster). In fact there is the same issue with 2 RHS on 4 cores, but overall results situation is much better.
Best regards,Andrey Kuzmin.