Possibility to parallelize Cholesky decomposition routine?

ZlamalJakub · ‎01-25-2018

I have old FORTRAN code for Cholesky decomposition of symmetrical FEM matrix (attached file). Subroutine seems to be the slowest part of conjugate gradient solver. I want to parallelize it, but it seems to me that it is impossible. Number of unknowns (variable "is") is typically 200 thousands to several millions. Variable "i2" is typically bellow 1000.

Has anybody have an idea how to rewrite routine to enable its parallel processing?

Thanks for any advice

andrew_4619 · ‎01-25-2018

I can see lots of ways it could be changed to allow better opimisation/vectorisation however why spend any time and effort, just try the mkl version which should be much faster anyway.

ZlamalJakub · ‎01-25-2018

Thanks for reply. I have searched for suitable MKL routine, but I have found only ?pbtrf for band matrix, but it needs much more storage than only diagonals of nonzero elements of matrix.

Can You give me a hint which routine will be suitable (considering also memory demands)?

jimdempseyatthecove · ‎01-26-2018

Jakub,

The code you provided is relatively old. Nothing wrong with this. However, it may have been written prior to CPUs having cache (or very large cache). And it was written before CPUs had small vector instructions. Significant improvements to this code, and likely much code elsewhere in your application, by recomposing the code, principally loops, to take advantage of these features. For example:

      d(1)=1.d0
      d1(1)=p1(1)
      d2(1)=p2(1)
      d3(1)=p3(1)
      d4(1)=p4(1)
      do 1 i=2,i2
         d(i)=(1.d0-d(i-1)*d1(i-1)**2)
         d1(i)=p1(i)/d(i)
         d4(i)=p4(i)/d(i)
         d3(i)=(p3(i)-d(i-1)*d1(i-1)*d4(i-1))/d(i)
         d2(i)=(p2(i)-d(i-1)*d3(i-1)*d1(i-1))/d(i)
    1 continue

Can be rewritten (untested code) to:

      d(1)=1.d0
      d1(1)=p1(1)
c This loop has loop carry dependencies and is not suitable for vectorization
      do 1 i=2,i2
         d(i)=(1.d0-d(i-1)*d1(i-1)**2)
         d1(i)=p1(i)/d(i)
    1 continue
      
      d4(1)=p4(1)
c This loop will vectorize, with d(i) likely coming out of cache
      do 11 i=2,i2
         d4(i)=p4(i)/d(i)
    11 continue
         
      d3(1)=p3(1)
c This loop will vectorize
c with d(i-1), d1(i-1), d4(i-1) and d(i) potentially coming out of cache
c or taking advantage of hardware prefetch
      do 12 i=2,i2
         d3(i)=(p3(i)-d(i-1)*d1(i-1)*d4(i-1))/d(i)
    12 continue
         
      d2(1)=p2(1)
c This loop will vectorize
c with d(i-1), d3(i-1), d1(i-1) and d(i) potentially coming out of cache
c or taking advantage of hardware prefetch
      do 13 i=2,i2
         d2(i)=(p2(i)-d(i-1)*d3(i-1)*d1(i-1))/d(i)
    13 continue

I did not take the time to look at the DO 2 loop to see if there are similar optimization opportunities. Cursory look seems to indicate too many loop carry dependencies.

As stated above, old code like this may have many opportunities like this.

*** OpenMP

The loops that do not have loop carry dependencies (11, 12, 13) can be parallelized if i2 is sufficiently large enough. Less than 1000 is not large enough.

Jim Dempsey

ZlamalJakub · ‎01-26-2018

Thanks for your suggestions Jim. I will try to redesign the loops as you suggested.

John_Campbell · ‎01-29-2018

This solver appears to be an "approx." decomposed result for a banded matrix, where:

c  B(i) = P4(i-i2-1) * V(i-i2-1)
c       + P3(i-i2)   * V(i-i2)
c       + P2(i-i2+1) * V(i-i2+1)
c       + P1(i-1)    * V(i-1)
c       + P (i)      * V(i)
c       + P1(i)      * V(i+1)
c       + P2(i)      * V(i+i2-1)
c       + P3(i)      * V(i+i2)
c       + P4(i+i2+1) * V(i+i2+1)

where is ~ 200,000 and i2 ~ 1000. The symmetric matrix appears to be 3 diagonal bands of 3 equations, separated by i2 equations.

The approximation appears to be that matrix infill is ignored and a rough solution is accepted, presumably for multiple iterations of the solution / reformulation process. You don't indicate how slow this process is.

This would produce a matrix of 1.6 gb if accurate decomposition was performed (?) Have you tried full decomposition, using a suitable MKL banded solver routine ? This would give you a timing comparison, (which may not be acceptable). Vectorising and multi-threading may provide a 20 x performance, which would probably not approach i2=1000 overhead. Still, it would be worth knowing.

To improve performance of this algorithm, you may be able to change your arrays to "real*8 P(0:4,IS), D(0:4,IS)" and utilise vector instructions for part of the calculation.

jimdempseyatthecove · ‎01-30-2018

>>"real*8 P(0:4,IS), D(0:4,IS)"

Don't you mean (1:4,IS) or (0:3,IS)? (and align the array)

IOW to have vector width stride per IS increment.

This may improve cache line utilization at the expense of 33% more memory.

Jim Dempsey

John_Campbell · ‎01-30-2018

Jakub,

I also tried to restyle the code, to better understand what was happening. In my attached _a code at lines 84:86 (DO 2 loop), I suspect these lines are out of order and you would not be getting the correct answer, However, as the result is an approximate reduction, it may be hard to tell, assuming I understand what is being performed. Perhaps option _c may be required, based on the general loop ?

Jim, there are 5 Dx() arrays so D(0:4,IS) would be required, although I have not changed the code to see if vector instructions are a possibility. As with all linear equation solvers, it is difficult to split the loops, as d1, d2 and d3 are all inter-related

ZlamalJakub · ‎01-30-2018

Dear Jim and John,

thanks for your effort. I will try your suggestions but it takes me some time. It seems that I do not fully understand whole code and I should study it carefuly and understand what it is doing.

It will not be as easy as I expected to speed it up (may be it is fast in comparison to other methods).

ZlamalJakub · ‎02-20-2018

I have more thoroughly studied my FORTRAN code, set timers to routines of ICCG method and found that most of time is spent in different routine than Cholesky decomposition (incomplete as mentioned John Campbell). I am sorry that Jim and John have spent their time with optimizing this routine due to my mistake. However I have used their suggestions in my optimization of code.

The most time consuming routine was preconditioning i.e. solving system of linear equations (foerward and backward substitution). At the end of my research I recomposed some do loops according to Jim Dempsey suggestions (about cached arrays and removing carry dependencies). I still did not tested one big array D(0:4,IS) instead 5 separed arrays D,...,D4.

Combining paralellizing of do loops and using AVX2 instructions the computation time of ICCG method was decreased by about 20 % which is from my point of view very good result. Greater improvement of calculation speed cannot be probably expected (ICCG method in old FORTRAN code is probably very fast itself).

Using complete Cholesky decomposition did not save the computation time.

I also tested MKL iccg method with ILUT preconditioner but it was much slower than old FORTRAN code.