Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6536 Discussions

omp threading of sparse matrix operation does not yield expected seed up. reason?


Hi all,

I have implemented a kind of a gauss seidel algorithm for a sparse upper triangular coefficient matrix, where multiplication of a vector with a the sparse matrix (forward propagation) is a core feature. For some application I have to do this process for several independent vectors (memory location), where the spares coefficient matrix remains unchanged (read only access). Therefore I thought this process might be executable in parallel  depending on the number of vectors. I expected an almost linear increase in speed as long as the number of vectors does not exceed  the number of cores. However, I observed only a marginal increase in speed. Here is the code:

Module mod_tmp
  Type :: csrmatrixuppertri
    integer :: nrows, ncols
    integer, allocatable :: colpos(:), rowpos(:)
    real, allocatable :: value(:)
  end type csr
  Subroutine SubGlobal(a,c)
    implicit none
    Type(csr), intent(in) :: c
    Real, Intent(inout) :: a(:,:)
    Integer :: isnt, c1
    !$ isnt=minval((/omp_get_max_threads(),size(a,2)/))
    !$OMP PARALLEL DO PRIVATE(c1) num_threads(isnt)
    Do c1=1,size(a,2)
      call sublocal(a=a(:,c1),c=c)
    End Do
  End Subroutine SubGlobal
  Subroutine SubLocal(av,c)
    Implicit none
    Type(csr), intent(in) :: c
    Real, Intent(inout) :: av(:)
    Integer :: c2, offdiagstart, offdiagend, diag
    Do c2=1,c%nrows
      !!get locations in csr
      !!calculate a new value
      !!forward propagate
    End do
  End Subroutine SubLocal
End Module mod_tmp

Threading sets in in "SubGlobal" by calling "SubLocal" as many times as "a" as columns, operations are carried out in "SubLocal". All writeable memory in "SubLocal" is contiguous and should accessible only be one thread. The csr matrix should be public to all threads, but only readable.

Using a matrix "a" of dimension "a(c%nrows,10), I am wondering why I don't see a decrease in processing time to roughly 1/10 when going from single core to 10 cores (with 36 cores being available). I suspect that the shared read is hampering the speed, but I might be wrong. Giving every thread a local copy of "c" is not feasible due to memory limitations (c is actually >100GB).

Any suggestions are highly appreciated.


0 Kudos
0 Replies