Solved: mkl_sparse_sp2m problem

may_ka · ‎08-30-2020

Hi all,

the program below implements the two stage sparse matrix multiplication in mkl_sparse_sp2m. After stage 1 the handle of C is destroyed and C is rebuild as a csr matrix with all necessary memory allocated. Then stage 2 is called. For some unknown reason the column pointer of C is all zero after the second stage although I understand the manual such that stage 2 actually generates the column indices. Further, the values in the output array do not seem to be sorted within row ........ which hampers any subsequent use in mkl sparse solvers.

Any comment elucidating the cause of the problem is much appreciated.

Operation system was linux, compiler and mkl version were 2020, update 2, compiler line was

ifort -fpp -i8 -mkl test.f90

Thanks a lot.

#include "mkl_spblas.f90"
program test
  use mkl_spblas
  use, intrinsic :: iso_fortran_env, only: int64, real64,int32
  use iso_c_binding, only: c_ptr, c_int, c_long_long, c_double,&
    & c_f_pointer
  implicit none
  !!the matrix
  !!0.5,0.0,0.1
  !!0.0,0.4,0.3
  !!0.4,0.1,0.2
  !!colum positions (/1,3,2,3,1,2,3/)
  !!row positions (/1,3,5,8/)
  !!values (/0.5,0.1,0.4,0.3,0.4,0.1,0.2/)
  type(sparse_matrix_t), allocatable :: a,b,c
  type(matrix_descr), allocatable :: a_d,b_d
  integer(int64), allocatable :: rp1(:),rp2(:),rp3(:), cp1(:),cp2(:),cp3(:)
  real(real64), allocatable :: val1(:), val2(:),val3(:)
  integer(int32) :: stat,index,a1,a2
  integer :: nrows,ncols
  integer(int64), pointer, contiguous :: rpp1(:)=>null(),rpp2(:)=>null()
  type(c_ptr) :: rows_start,rows_end,col_indx,values
  real(real64), allocatable :: rt(:,:)
  !!@@@@@@@@@@@@@@@@@
  !!build a dense matrix for comparison
  allocate(rt(3,3))
  rt(1,:)=(/0.5,0.0,0.1/)
  rt(2,:)=(/0.0,0.4,0.3/)
  rt(3,:)=(/0.4,0.1,0.2/)
  rt=matmul(rt,rt)
  !!@@@@@@@@@@@@@@@@@@@@@@
  !!initialize a an b and descriptors a_d and b_d
  allocate(a,b,c,a_d,b_d)
  a_d%type=sparse_matrix_type_general
  a_d%mode=SPARSE_FILL_MODE_FULL
  a_d%diag=sparse_diag_non_unit
  b_d%type=sparse_matrix_type_general
  b_d%mode=SPARSE_FILL_MODE_FULL
  b_d%diag=sparse_diag_non_unit
  !!@@@@@@@@@@@@@@@@@@@@@@@@@y
  !!create matrices a and b
  rp1=(/1,3,5,8/);rp2=(/1,3,5,8/)
  cp1=(/1,3,2,3,1,2,3/);cp2=(/1,3,2,3,1,2,3/)
  val1=(/0.5,0.1,0.4,0.3,0.4,0.1,0.2/)
  val2=(/0.5,0.1,0.4,0.3,0.4,0.1,0.2/)
  stat=MKL_SPARSE_D_CREATE_CSR(&
    &a,&
    &sparse_index_base_one,&
    &3,&
    &3,&
    &rp1(1:(size(rp1)-1)),&
    &rp1(2:size(rp1)),&
    &cp1,&
    &val1)
  if(stat/=0) call se("err_1")
  stat=MKL_SPARSE_D_CREATE_CSR(&
    &b,&
    &sparse_index_base_one,&
    &3,&
    &3,&
    &rp2(1:(size(rp2)-1)),&
    &rp2(2:size(rp2)),&
    &cp2,&
    &val2)
  if(stat/=0) call se("err_2")
  !!@@@@@@@@@@@@@@@@@@@@@@@@@
  !!get nnz
  stat=mkl_sparse_sp2m(&
    &opA=sparse_operation_non_transpose,&
    &descrA=a_d,&
    &A=a,&
    &opB=sparse_operation_non_transpose,&
    &descrB=b_d,&
    &B=b,&
    &req=SPARSE_STAGE_NNZ_COUNT,&
    &C=c)
  if(stat/=0) call se("err_2")
  index=sparse_index_base_one
  stat=MKL_SPARSE_D_EXPORT_CSR(&
      &c,&
      &index,&
      &nrows,&
      &ncols,&
      &rows_start,&
      &rows_end,&
      &col_indx,&
      &values)
  if(stat/=0) call se("err_3")
  call c_f_pointer(rows_start,rpp1,[nrows])
  call c_f_pointer(rows_end,rpp2,[nrows])
  write(*,"(*(g0:"",""))") rpp1
  write(*,"(*(g0:"",""))") rpp2
  !!@@@@@@@@@@@@@@@@@@@@
  !!rebuild a new C with all necessary memory
  stat=mkl_sparse_destroy(c)
  if(stat/=0) call se("err_4")
  deallocate(c);allocate(c)
  allocate(rp3(size(rpp1)+1));rp3(1:size(rpp1))=rpp1
  rp3(size(rp3))=rpp2(size(rpp2))
  allocate(cp3(rpp2(size(rpp2))-1),source=0_int64)
  allocate(val3(rpp2(size(rpp2))-1),source=0._real64)
  stat=MKL_SPARSE_D_CREATE_CSR(&
    &c,&
    &sparse_index_base_one,&
    &3,&
    &3,&
    &rp3(1:(size(rp3)-1)),&
    &rp3(2:size(rp3)),&
    &cp3,&
    &val3)
  if(stat/=0) call se("err_5")
  !!@@@@@@@@@@@@@@@@@@@@@@@
  !!perform multiplication using preallocated memory
  stat=mkl_sparse_sp2m(&
    &opA=sparse_operation_non_transpose,&
    &descrA=a_d,&
    &A=a,&
    &opB=sparse_operation_non_transpose,&
    &descrB=b_d,&
    &B=b,&
    &req=SPARSE_STAGE_FINALIZE_MULT,&
    &C=c)
  if(stat/=0) call se("err_6")
  !!column pointer contains zeros only
  write(*,"(*(g0:"",""))") cp3
  !!result vector seems to be unordered within row
  write(*,"(*(g0:"",""))") val3
  !!correct result from dense matrix operation
  write(*,"(*(g0:"",""))") ((rt(a1,a2),a2=1,size(rt,2)),a1=1,size(rt,1))
  !!@@@@@@@@@@@@@@@@@@@@@
contains
  subroutine se(c0e)
    character(*), intent(in) :: c0e
    write(*,*) c0e
    call exit(1)
  end subroutine se
end program test

Kirill_V_Intel · ‎08-31-2020

Hello @may_ka,

Current API of sp2m does not allow your usage model. Specifically, it is not allowed to allocate memory for the output matrix on the side of the user.
This is why you see that column indices are not computed in your example since sp2m detects an already existing pointer for the column indices (which is the one you allocated). Also, if you allow MKL to allocate the memory for the output (remove your export, destroy and create_csr calls inbetween the phases), then the results will be the same for parallel vs sequential code and the same which you called "correct".

The API of sp2m however allows you to recompute the values if the structure remain unchanged (via SPARSE_STAGE_FINALIZE_MULT).

Notes:
1) I looked at the docs and would agree that it is not clear, we will try to improve it.
2) We already have a request for allowing the user to allocate all memory for the output matrix. You can submit a feature request if you want to increase priority of this task.
3) If you were a C user, I would suggest looking at mkl_graph_mxm https://software.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/graph-routines/graph-functionality/graph-operations/mkl-graph-mxm.html which covers most of the functionality of mkl_sparse_sp2m and allows the user to allocate the output memory, via a multistage executon. It is supported starting with oneMKL beta-08 (and the coming MKL 2020u3).

I hope this helps. Let me know if you have further questions/comments.

Best,
Kirill

View solution in original post

may_ka · ‎08-30-2020

Further to my above post, the above program yields different results when linked against threaded and non-threaded version of the mkl. with makefile

INTELMKL=/opt/intel/compilers_and_libraries_2020.2.254/linux/
MKLROOT=$(INTELMKL)/mkl
MKLPATH=$(MKLROOT)/lib/intel64
MKLPARA=$(MKLPATH)/libmkl_blas95_ilp64.a \
	$(MKLPATH)/libmkl_lapack95_ilp64.a -Wl,--start-group \
	$(MKLPATH)/libmkl_intel_ilp64.a \
	$(MKLPATH)/libmkl_core.a \
	$(MKLPATH)/libmkl_intel_thread.a -Wl,--end-group -lpthread -lm -ldl

MKLSEQ=$(MKLPATH)/libmkl_blas95_ilp64.a \
	$(MKLPATH)/libmkl_lapack95_ilp64.a -Wl,--start-group \
	$(MKLPATH)/libmkl_intel_ilp64.a \
	$(MKLPATH)/libmkl_core.a \
	$(MKLPATH)/libmkl_sequential.a -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -lm -ldl

source:=test.f90
all: para seq
para:
	ifort -i8 -mkl=parallel -fpp -o para test.f90 $(MKLPARA)
seq:
	ifort -i8 -mkl=sequential -fpp -o seq test.f90 $(MKLSEQ)
clean:
	rm para seq

version "seq" yields a result vector (the value vector of C) of:

.2900000011920929
.7000000134110451E-01
.1000000029802323E01
.3100000129640104
.1800000098347665
.000000000000000
.3400000071525574
.1100000040233136
.000000000000000

whereas version "para" yields:

.2900000011920929
.7000000134110451E-01
.1000000029802323E01
.1900000064074994
.1800000098347665
.1200000065565110
.2800000053644180
.1100000040233136
.6000000178813936E-01

where the latter lines up with the results from dense multiplication.

Kirill_V_Intel · ‎08-31-2020

Hello @may_ka,

Current API of sp2m does not allow your usage model. Specifically, it is not allowed to allocate memory for the output matrix on the side of the user.
This is why you see that column indices are not computed in your example since sp2m detects an already existing pointer for the column indices (which is the one you allocated). Also, if you allow MKL to allocate the memory for the output (remove your export, destroy and create_csr calls inbetween the phases), then the results will be the same for parallel vs sequential code and the same which you called "correct".

The API of sp2m however allows you to recompute the values if the structure remain unchanged (via SPARSE_STAGE_FINALIZE_MULT).

Notes:
1) I looked at the docs and would agree that it is not clear, we will try to improve it.
2) We already have a request for allowing the user to allocate all memory for the output matrix. You can submit a feature request if you want to increase priority of this task.
3) If you were a C user, I would suggest looking at mkl_graph_mxm https://software.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/graph-routines/graph-functionality/graph-operations/mkl-graph-mxm.html which covers most of the functionality of mkl_sparse_sp2m and allows the user to allocate the output memory, via a multistage executon. It is supported starting with oneMKL beta-08 (and the coming MKL 2020u3).

I hope this helps. Let me know if you have further questions/comments.

Best,
Kirill

may_ka · ‎09-03-2020

Thanks for the explanation.

Gennady_F_Intel · ‎09-03-2020

This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only