Re: mkl_sparse_sp2m output order problem when called repeately

may_ka · ‎09-28-2020

Hi MKL team,

the little program below implements the "mkl_sparse_sp2m" routine with two calls using the same arguments.

From the manual I understood that after an initial call using "SPARSE_STAGE_FULL_MULT" to create all necessary structures the routine can be called repeatedly using "SPARSE_STAGE_FINALIZE_MULT" as long as arrays A and B do not change their non-zero pattern (though the values in A and B may change).

In the example below the result from the first call lines up with the result from a dense matmul operation IF the routine "mkl_sparse_order" is called after the "mkl_sparse_sp2m". However, for the second call, using the same A and B, results do not line up as values in C are not ordered. Calling "mkl_sparse_order" has no effect because the second "mkl_sparse_sp2m" call does not seem to affect the values in the column index array (the remain sorted), but put the values into the value array in a manner which aligns with the initial unsorted column index values.

A possible work around might be to always call "mkl_sparse_sp2m" with argument "SPARSE_STAGE_FULL_MULT" after destroying and recreating the result handle, but my understanding is that this would contradict the idea of the two stage approach allowing to reuse memory.

Is this a wanted behavior or a bug?

Find the program below.

BTW: mkl and intel fortran compiler are those shipped with parallel_studio_xe_2020.2.108, operation system is linux, kernel version 5.8.11

Thank a lot

#include "mkl_spblas.f90"
program test
  use mkl_spblas
  use, intrinsic :: iso_fortran_env, only: int64, real64,int32
  use iso_c_binding, only: c_ptr, c_int, c_long_long, c_double,&
    & c_f_pointer
  implicit none
  !!the matrix
  !!0.5,0.0,0.1
  !!0.0,0.4,0.3
  !!0.4,0.1,0.2
  !!colum positions (/1,3,2,3,1,2,3/)
  !!row positions (/1,3,5,8/)
  !!values (/0.5,0.1,0.4,0.3,0.4,0.1,0.2/)
  type(sparse_matrix_t), allocatable :: a,b,c
  type(matrix_descr), allocatable :: a_d,b_d
  integer(int64), allocatable :: rp1(:),rp2(:),rp3(:), cp1(:),cp2(:),cp3(:)
  real(real64), allocatable :: val1(:), val2(:),val3(:)
  integer(int32) :: stat,index,a1,a2
  integer :: nrows,ncols
  integer(int64), pointer, contiguous :: rpp1(:)=>null(),rpp2(:)&
    &=>null(), cp(:)=>null()
  real(real64), pointer, contiguous :: val(:)=>null()
  type(c_ptr) :: rows_start,rows_end,col_indx,values
  real(real64), allocatable :: rto(:,:), rt1(:,:),rex(:,:)
  !!@@@@@@@@@@@@@@@@@
  !!build a dense matrix for comparison
  allocate(rto(3,3))
  rto(1,:)=(/0.5,0.0,0.1/)
  rto(2,:)=(/0.0,0.4,0.3/)
  rto(3,:)=(/0.4,0.1,0.2/)
  rt1=matmul(rto,rto)
  !!@@@@@@@@@@@@@@@@@@@@@@
  !!initialize a an b and descriptors a_d and b_d
  allocate(a,b,c,a_d,b_d)
  a_d%type=sparse_matrix_type_general
  a_d%mode=SPARSE_FILL_MODE_FULL
  a_d%diag=sparse_diag_non_unit
  b_d%type=sparse_matrix_type_general
  b_d%mode=SPARSE_FILL_MODE_FULL
  b_d%diag=sparse_diag_non_unit
  !!@@@@@@@@@@@@@@@@@@@@@@@@@y
  !!create matrices a and b
  rp1=(/1,3,5,8/);rp2=(/1,3,5,8/)
  cp1=(/1,3,2,3,1,2,3/);cp2=(/1,3,2,3,1,2,3/)
  val1=(/0.5,0.1,0.4,0.3,0.4,0.1,0.2/)
  val2=(/0.5,0.1,0.4,0.3,0.4,0.1,0.2/)
  stat=MKL_SPARSE_D_CREATE_CSR(&
    &a,&
    &sparse_index_base_one,&
    &3,&
    &3,&
    &rp1(1:(size(rp1)-1)),&
    &rp1(2:size(rp1)),&
    &cp1,&
    &val1)
  if(stat/=0) call se("err_1")
  stat=MKL_SPARSE_D_CREATE_CSR(&
    &b,&
    &sparse_index_base_one,&
    &3,&
    &3,&
    &rp2(1:(size(rp2)-1)),&
    &rp2(2:size(rp2)),&
    &cp2,&
    &val2)
  if(stat/=0) call se("err_2")
  !!@@@@@@@@@@@@@@@@@@@@@@@@@
  stat=mkl_sparse_sp2m(&
    &opA=sparse_operation_non_transpose,&
    &descrA=a_d,&
    &A=a,&
    &opB=sparse_operation_non_transpose,&
    &descrB=b_d,&
    &B=b,&
    &req=SPARSE_STAGE_FULL_MULT,&
    &C=c)
  if(stat/=0) call se("err_2")
  stat=mkl_sparse_order(c)
  index=sparse_index_base_one
  stat=MKL_SPARSE_D_EXPORT_CSR(&
      &c,&
      &index,&
      &nrows,&
      &ncols,&
      &rows_start,&
      &rows_end,&
      &col_indx,&
      &values)
  if(stat/=0) call se("err_3")
  call c_f_pointer(rows_start,rpp1,[nrows])
  call c_f_pointer(rows_end,rpp2,[nrows])
  call c_f_pointer(col_indx,cp,[rpp2(nrows)-1])
  call c_f_pointer(values,val,[rpp2(nrows)-1])
  write(*,*) "first call"
  write(*,"(*(g0:"",""))") rpp1
  write(*,"(*(g0:"",""))") rpp2
  write(*,"(*(g0:"",""))") cp
  write(*,"(*(g0:"",""))") val
  !!expand the sparse result
  allocate(rex(nrows,ncols),source=0._real64)
  do a1=1,nrows
    do a2=rpp1(a1),rpp2(a1)-1
      rex(a1,cp(a2))=val(a2)
    end do
  end do
  if(maxval(abs(rex-rt1))>10e-12) call se("err_4")
  do a1=1,size(rex,1)
    write(52,"(*(g0:"",""))") rex(a1,:)
  end do
  !!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  !!second call
  stat=mkl_sparse_sp2m(&
    &opA=sparse_operation_non_transpose,&
    &descrA=a_d,&
    &A=a,&
    &opB=sparse_operation_non_transpose,&
    &descrB=b_d,&
    &B=b,&
    &req=SPARSE_STAGE_FINALIZE_MULT,&
    &C=c)
  if(stat/=0) call se("err_2")
  !!@@@@@@@@@@@@@@@@@@
  !!with or without reordering and exporting makes no difference
  stat=mkl_sparse_order(c)
  index=sparse_index_base_one
  stat=MKL_SPARSE_D_EXPORT_CSR(&
      &c,&
      &index,&
      &nrows,&
      &ncols,&
      &rows_start,&
      &rows_end,&
      &col_indx,&
      &values)
  if(stat/=0) call se("err_5")
  call c_f_pointer(rows_start,rpp1,[nrows])
  call c_f_pointer(rows_end,rpp2,[nrows])
  call c_f_pointer(col_indx,cp,[rpp2(nrows)-1])
  call c_f_pointer(values,val,[rpp2(nrows)-1])
  write(*,*) "second call"
  write(*,"(*(g0:"",""))") rpp1
  write(*,"(*(g0:"",""))") rpp2
  write(*,"(*(g0:"",""))") cp
  write(*,"(*(g0:"",""))") val
  !!@@@@@@@@@@@@@@@@@@@@@@@
  !!expand the sparse result
  rex(:,:)=0._real64
  do a1=1,nrows
    do a2=rpp1(a1),rpp2(a1)-1
      rex(a1,cp(a2))=val(a2)
    end do
  end do
  if(maxval(abs(rex-rt1))>10e-12) then
    do a1=1,size(rt1,1)
      write(61,"(*(g0:"",""))") rt1(a1,:)
    end do
    do a1=1,size(rex,1)
      write(62,"(*(g0:"",""))") rex(a1,:)
    end do
    call se("err_6")
  end if
  !!@@@@@@@@@@@@@@@@@@@@
  !!@@@@@@@@@@@@@@@@@@@@@
contains
  subroutine se(c0e)
    character(*), intent(in) :: c0e
    write(*,*) c0e
    call exit(1)
  end subroutine se
end program test

Kirill_V_Intel · ‎09-30-2020

Hi @may_ka,

Currently, the observed behavior is "by design" and expected. The algorthm used in MKL currently can re-use the information about the output pattern only if this information (as well as the patterns of the inputs) has not been tampered by smth like mkl_sparse_order.

If your workflow is such that you first compute, say, 100 matrices and then you want to use them with sorted indices in subsequent computations, the fastest would be to call sp2m so that you get one unordered ja and lots of values (for different values of A and B) and then write yourself a code which will sort ja and swap all arrays with values. Of course, this is a solution outside of IE SpBLAS API.

In other cases I don't see a workaround which would allow avoiding extra memory (say, for creating a copy of column indices for each output matrix) and would be using sp2m.

The problem here is that the current algorithm produces the unsorted column indices. There exist algorithms which produce sorted outputs though or which can do FINALIZE_MULT phase and take the sorted column indices as input.

If the case you mentioned is critical for your application, please let us know. We may open a FR to change the implementation of the FINALIZE_MULT phase to allow the use case you descirbed.

I hope this helps.

Best,
Kirill

may_ka · ‎10-05-2020

Hi Kirill,

thanks for the elaborate response.

In general I am looking for maximum efficiency when implementing repeated matrix multiplications. Avoidable memory allocations or data copying can produce overheads which may render algorithms infeasible which would have been feasible otherwise. As an example, mkl_sparse_sp2m is used repeatedly to produce array C from A*B, where the non-zero patterns in A, B and C do not change, but the values in A and B do. Subsequently C is fed into pardiso for factorization etc. Having a copying/sorting overhead may become a serious overhead depending on the time for other operations in the workflow.

Therefore it would be much appreciated if the inspector executor frame would provide api facilities which produce output readily usable with other mkl facilities. Futher, as elaborated on in another thread already, it would be great if the api contains options for the memory management of array C by the user.

best regards

Kirill_V_Intel · ‎10-09-2020

Hi @may_ka,

I agree with your point about the issue in the described workflow. Also I agree that such a workflow makes sense.

I think there is a way currently to avoid the memory allocations, through the recently introduced graph functionality (with multistage properly functioning in MKL 2020u2, u3 and MKL 2020u4, our most recent MKL release).

What this approach can do: it allows the user to allocate memory for the output. So you can compute C1 = A * B in two-stage where you will allocate memory for all output arrays, then initialize C2 with ia and ja from C1 (it would not do any copying, just take in your arrays) and newly allocate values, change the values of A and B, call mkl_graph_mxm with request = MKL_GRAPH_REQUEST_FILL_ENTRIES and it will compute ja and values for C2. And, what is important, mkl_graph_mxm computes the sorted column indices in the output.

As the patterns of A and B remain the same, the pattern of C1 and C2 will be equal so re-computed ja will be the same.

So, this way of using MKL Graph API allows you to avoid creating unnecessary buffers for ja (you will use the same memory for ia and ja and only allocate new arrays for the values) and get sorted column indices for the result of sparse matrix-matrix multiplication. Do you think it makes sense for your workflow?

If you are interested in looking at an example code, we can ask @Gennady_F_Intel to help us with it. In MKL we currently have an example graphc_mxm_multistage which demonstrated how to use multistage approach with graph APIs.

Best,
Kirill