Threaded band-triangular solvers in MKL

Magnus_G_ · ‎05-02-2017

We have an application where we repeatedly need to solve systems of equations where the coefficient matrix is on triangular band form, and is symmetric positive definite. Thus we factorize once with DPBTRF and solve with DPBTRS. The problem we have is that DPBTRS does not seem to get any benefit at all from running on multiple threads. The factorization scales nicely, but as we call DPBTRS thousands of times, the solution becomes the main bottleneck in the code and it does not scale at all with the number of threads used. We do this within an ARPACK iteration scheme, so we can only solve for a single right hand side at a time, and we typically solve for hundreds of eigenpairs,

A typical example in our case has a matrix size of 200 000 and a bandwidth of 2500. This is quite narrow, so we do understand that there is not much room for parallelism here. However, in order to understand our situation, we have tried to experiment with different matrix sizes and bandwidths but we see no parallel improvements whatsoever. According to the MKL Developer Guide, PBTRS should be threaded. We also tried to write our own test code for comparison. With a simple text-book parallelization of the forward- and backward substitution i PBTRS we achieved a speedup for bandwidths of 4000 and larger, whereas MKL gives no speedup for any bandwidth we tried (in our tests, the largest bandwidth we could fit in memory was 64000). We have also tried other banded solvers with similar results.

We have run our tests on a dual socket Intel Xeon E5-2640 system with a total of 12 cores and 96 MB of memory. We have tried Composer XE 2016 and 2017 with similar results.

My question is: When can we expect parallel improvements in banded solvers?

Gennady_F_Intel · ‎05-04-2017

Magnus, we are going to add this improvements into one of the nearest updates We will keep you informed. thanks Gennady

Magnus_G_ · ‎05-04-2017

Thank you for the prompt reply. Does this mean that you confirm that the current version of DPBTRS is not threaded? If it is, under what circumstances does it run on multiple threads? Is there anything we can do to improve the performance of DPBTRS in the current version of MKL? Is it worthwhile using PARDISO for this operation?

Konstantin_A_Intel · ‎05-04-2017

Hi Magnus,

DPBTRS is parallelized over NRHS right now in MKL LAPACK, so for NRHS=1 the sequential version is used. On the other hand, sequential version is blocked and tries to call BLAS routines of relatively large sizes (e.g., for KD=2500, block size can be 512 or so). However, I found out some issues preventing threading of that routines in the context of DPBTRS.

In my experiments, I can see up to 3x performance improvement on Intel(R) Xeon(R) CPU E5-2699 v4. I would like to get your feedback about it - is it enough? Please take into account that for NRHS=1 this algorithm is fairly memory bound, and scalability is quite limited anyway.

I'm afraid that there's no workaround allowing you to improve performance without updating to newer MKL version where it (hopefully) be fixed. I'm also skeptical about PARDISO for band matrices. The fill-in will introduce a lot of additional operations.

One question: do you store matrices in Lower or Upper form?

Regards,

Konstantin

Magnus_G_ · ‎05-05-2017

Konstantin,

At the moment, PBTRS is the most time consuming part of our solver, so any improvements we can get are welcome. As I mentioned, we call this routine within an ARPACK iteration scheme, which is also memory bound, and with threaded MKL we typically see speedups of 2.5x-3x at best. We cannot solve for more than one RHS at a time, so I guess we are out of luck then with the current version of MKL. As always with memory bound applications, I think you should be happy for the parallel improvements you can get, so 3x sounds good enough to me.

We use lower triangular storage for our factorized matrices.

I guess we will have to keep our eyes open for MKL updates and hope for this issue to be resolved in the near future. At least we understand now how the solver is parallelized and why we don't see any parallel improvements.

Magnus