omp threading of sparse matrix operation does not yield expected seed up. reason?

may_ka · ‎09-18-2018

Hi all,

I have implemented a kind of a gauss seidel algorithm for a sparse upper triangular coefficient matrix, where multiplication of a vector with a the sparse matrix (forward propagation) is a core feature. For some application I have to do this process for several independent vectors (memory location), where the spares coefficient matrix remains unchanged (read only access). Therefore I thought this process might be executable in parallel depending on the number of vectors. I expected an almost linear increase in speed as long as the number of vectors does not exceed the number of cores. However, I observed only a marginal increase in speed. Here is the code:

Module mod_tmp
  Type :: csrmatrixuppertri
    integer :: nrows, ncols
    integer, allocatable :: colpos(:), rowpos(:)
    real, allocatable :: value(:)
  end type csr
contains
  Subroutine SubGlobal(a,c)
    implicit none
    Type(csr), intent(in) :: c
    Real, Intent(inout) :: a(:,:)
    Integer :: isnt, c1
    !$ isnt=minval((/omp_get_max_threads(),size(a,2)/))
    !$OMP PARALLEL DO PRIVATE(c1) num_threads(isnt)
    Do c1=1,size(a,2)
      call sublocal(a=a(:,c1),c=c)
    End Do
    !$OMP END PARALLEL DO
  End Subroutine SubGlobal
  Subroutine SubLocal(av,c)
    Implicit none
    Type(csr), intent(in) :: c
    Real, Intent(inout) :: av(:)
    Integer :: c2, offdiagstart, offdiagend, diag
    Do c2=1,c%nrows
      !!get locations in csr
      diag=c%rowpos(c2)
      offdiagstart=c%rowpos(c2)+1
      offdiagend=c%rowpos(c2+1)-1
      !!calculate a new value
      av(c2)=av(c2)/c%value(diag)
      !!forward propagate
      av(c%colpos(offdiagstart:offdiagend))=&
        &av(c%colpos(offdiagstart:offdiagend))+&
        &c%value(offdiagstart:offdiagend)*(av(c2))
    End do
  End Subroutine SubLocal
End Module mod_tmp

Threading sets in in "SubGlobal" by calling "SubLocal" as many times as "a" as columns, operations are carried out in "SubLocal". All writeable memory in "SubLocal" is contiguous and should accessible only be one thread. The csr matrix should be public to all threads, but only readable.

Using a matrix "a" of dimension "a(c%nrows,10), I am wondering why I don't see a decrease in processing time to roughly 1/10 when going from single core to 10 cores (with 36 cores being available). I suspect that the shared read is hampering the speed, but I might be wrong. Giving every thread a local copy of "c" is not feasible due to memory limitations (c is actually >100GB).

Any suggestions are highly appreciated.

Cheers