Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library
- mkl_sparse_sp2m output order problem when called repeately

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

may_ka

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-28-2020
11:40 PM

312 Views

mkl_sparse_sp2m output order problem when called repeately

Hi MKL team,

the little program below implements the "mkl_sparse_sp2m" routine with two calls using the same arguments.

From the manual I understood that after an initial call using "SPARSE_STAGE_FULL_MULT" to create all necessary structures the routine can be called repeatedly using "SPARSE_STAGE_FINALIZE_MULT" as long as arrays A and B do not change their non-zero pattern (though the values in A and B may change).

In the example below the result from the first call lines up with the result from a dense matmul operation IF the routine "mkl_sparse_order" is called after the "mkl_sparse_sp2m". However, for the second call, using the same A and B, results do not line up as values in C are not ordered. Calling "mkl_sparse_order" has no effect because the second "mkl_sparse_sp2m" call does not seem to affect the values in the column index array (the remain sorted), but put the values into the value array in a manner which aligns with the initial unsorted column index values.

A possible work around might be to always call "mkl_sparse_sp2m" with argument "SPARSE_STAGE_FULL_MULT" after destroying and recreating the result handle, but my understanding is that this would contradict the idea of the two stage approach allowing to reuse memory.

Is this a wanted behavior or a bug?

Find the program below.

BTW: mkl and intel fortran compiler are those shipped with parallel_studio_xe_2020.2.108, operation system is linux, kernel version 5.8.11

Thank a lot

```
#include "mkl_spblas.f90"
program test
use mkl_spblas
use, intrinsic :: iso_fortran_env, only: int64, real64,int32
use iso_c_binding, only: c_ptr, c_int, c_long_long, c_double,&
& c_f_pointer
implicit none
!!the matrix
!!0.5,0.0,0.1
!!0.0,0.4,0.3
!!0.4,0.1,0.2
!!colum positions (/1,3,2,3,1,2,3/)
!!row positions (/1,3,5,8/)
!!values (/0.5,0.1,0.4,0.3,0.4,0.1,0.2/)
type(sparse_matrix_t), allocatable :: a,b,c
type(matrix_descr), allocatable :: a_d,b_d
integer(int64), allocatable :: rp1(:),rp2(:),rp3(:), cp1(:),cp2(:),cp3(:)
real(real64), allocatable :: val1(:), val2(:),val3(:)
integer(int32) :: stat,index,a1,a2
integer :: nrows,ncols
integer(int64), pointer, contiguous :: rpp1(:)=>null(),rpp2(:)&
&=>null(), cp(:)=>null()
real(real64), pointer, contiguous :: val(:)=>null()
type(c_ptr) :: rows_start,rows_end,col_indx,values
real(real64), allocatable :: rto(:,:), rt1(:,:),rex(:,:)
!!@@@@@@@@@@@@@@@@@
!!build a dense matrix for comparison
allocate(rto(3,3))
rto(1,:)=(/0.5,0.0,0.1/)
rto(2,:)=(/0.0,0.4,0.3/)
rto(3,:)=(/0.4,0.1,0.2/)
rt1=matmul(rto,rto)
!!@@@@@@@@@@@@@@@@@@@@@@
!!initialize a an b and descriptors a_d and b_d
allocate(a,b,c,a_d,b_d)
a_d%type=sparse_matrix_type_general
a_d%mode=SPARSE_FILL_MODE_FULL
a_d%diag=sparse_diag_non_unit
b_d%type=sparse_matrix_type_general
b_d%mode=SPARSE_FILL_MODE_FULL
b_d%diag=sparse_diag_non_unit
!!@@@@@@@@@@@@@@@@@@@@@@@@@y
!!create matrices a and b
rp1=(/1,3,5,8/);rp2=(/1,3,5,8/)
cp1=(/1,3,2,3,1,2,3/);cp2=(/1,3,2,3,1,2,3/)
val1=(/0.5,0.1,0.4,0.3,0.4,0.1,0.2/)
val2=(/0.5,0.1,0.4,0.3,0.4,0.1,0.2/)
stat=MKL_SPARSE_D_CREATE_CSR(&
&a,&
&sparse_index_base_one,&
&3,&
&3,&
&rp1(1:(size(rp1)-1)),&
&rp1(2:size(rp1)),&
&cp1,&
&val1)
if(stat/=0) call se("err_1")
stat=MKL_SPARSE_D_CREATE_CSR(&
&b,&
&sparse_index_base_one,&
&3,&
&3,&
&rp2(1:(size(rp2)-1)),&
&rp2(2:size(rp2)),&
&cp2,&
&val2)
if(stat/=0) call se("err_2")
!!@@@@@@@@@@@@@@@@@@@@@@@@@
stat=mkl_sparse_sp2m(&
&opA=sparse_operation_non_transpose,&
&descrA=a_d,&
&A=a,&
&opB=sparse_operation_non_transpose,&
&descrB=b_d,&
&B=b,&
&req=SPARSE_STAGE_FULL_MULT,&
&C=c)
if(stat/=0) call se("err_2")
stat=mkl_sparse_order(c)
index=sparse_index_base_one
stat=MKL_SPARSE_D_EXPORT_CSR(&
&c,&
&index,&
&nrows,&
&ncols,&
&rows_start,&
&rows_end,&
&col_indx,&
&values)
if(stat/=0) call se("err_3")
call c_f_pointer(rows_start,rpp1,[nrows])
call c_f_pointer(rows_end,rpp2,[nrows])
call c_f_pointer(col_indx,cp,[rpp2(nrows)-1])
call c_f_pointer(values,val,[rpp2(nrows)-1])
write(*,*) "first call"
write(*,"(*(g0:"",""))") rpp1
write(*,"(*(g0:"",""))") rpp2
write(*,"(*(g0:"",""))") cp
write(*,"(*(g0:"",""))") val
!!expand the sparse result
allocate(rex(nrows,ncols),source=0._real64)
do a1=1,nrows
do a2=rpp1(a1),rpp2(a1)-1
rex(a1,cp(a2))=val(a2)
end do
end do
if(maxval(abs(rex-rt1))>10e-12) call se("err_4")
do a1=1,size(rex,1)
write(52,"(*(g0:"",""))") rex(a1,:)
end do
!!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
!!second call
stat=mkl_sparse_sp2m(&
&opA=sparse_operation_non_transpose,&
&descrA=a_d,&
&A=a,&
&opB=sparse_operation_non_transpose,&
&descrB=b_d,&
&B=b,&
&req=SPARSE_STAGE_FINALIZE_MULT,&
&C=c)
if(stat/=0) call se("err_2")
!!@@@@@@@@@@@@@@@@@@
!!with or without reordering and exporting makes no difference
stat=mkl_sparse_order(c)
index=sparse_index_base_one
stat=MKL_SPARSE_D_EXPORT_CSR(&
&c,&
&index,&
&nrows,&
&ncols,&
&rows_start,&
&rows_end,&
&col_indx,&
&values)
if(stat/=0) call se("err_5")
call c_f_pointer(rows_start,rpp1,[nrows])
call c_f_pointer(rows_end,rpp2,[nrows])
call c_f_pointer(col_indx,cp,[rpp2(nrows)-1])
call c_f_pointer(values,val,[rpp2(nrows)-1])
write(*,*) "second call"
write(*,"(*(g0:"",""))") rpp1
write(*,"(*(g0:"",""))") rpp2
write(*,"(*(g0:"",""))") cp
write(*,"(*(g0:"",""))") val
!!@@@@@@@@@@@@@@@@@@@@@@@
!!expand the sparse result
rex(:,:)=0._real64
do a1=1,nrows
do a2=rpp1(a1),rpp2(a1)-1
rex(a1,cp(a2))=val(a2)
end do
end do
if(maxval(abs(rex-rt1))>10e-12) then
do a1=1,size(rt1,1)
write(61,"(*(g0:"",""))") rt1(a1,:)
end do
do a1=1,size(rex,1)
write(62,"(*(g0:"",""))") rex(a1,:)
end do
call se("err_6")
end if
!!@@@@@@@@@@@@@@@@@@@@
!!@@@@@@@@@@@@@@@@@@@@@
contains
subroutine se(c0e)
character(*), intent(in) :: c0e
write(*,*) c0e
call exit(1)
end subroutine se
end program test
```

Link Copied

3 Replies

Kirill_V_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-30-2020
08:45 PM

289 Views

Hi @may_ka,

Currently, the observed behavior is "by design" and expected. The algorthm used in MKL currently can re-use the information about the output pattern only if this information (as well as the patterns of the inputs) has not been tampered by smth like mkl_sparse_order.

If your workflow is such that you first compute, say, 100 matrices and then you want to use them with sorted indices in subsequent computations, the fastest would be to call sp2m so that you get one unordered ja and lots of values (for different values of A and B) and then write yourself a code which will sort ja and swap all arrays with values. Of course, this is a solution outside of IE SpBLAS API.

In other cases I don't see a workaround which would allow avoiding extra memory (say, for creating a copy of column indices for each output matrix) and would be using sp2m.

The problem here is that the current algorithm produces the unsorted column indices. There exist algorithms which produce sorted outputs though or which can do FINALIZE_MULT phase and take the sorted column indices as input.

If the case you mentioned is critical for your application, **please let us know**. We may open a FR to change the implementation of the FINALIZE_MULT phase to allow the use case you descirbed.

I hope this helps.

Best,

Kirill

may_ka

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-05-2020
07:18 AM

266 Views

Hi Kirill,

thanks for the elaborate response.

In general I am looking for maximum efficiency when implementing repeated matrix multiplications. Avoidable memory allocations or data copying can produce overheads which may render algorithms infeasible which would have been feasible otherwise. As an example, mkl_sparse_sp2m is used repeatedly to produce array C from A*B, where the non-zero patterns in A, B and C do not change, but the values in A and B do. Subsequently C is fed into pardiso for factorization etc. Having a copying/sorting overhead may become a serious overhead depending on the time for other operations in the workflow.

Therefore it would be much appreciated if the inspector executor frame would provide api facilities which produce output readily usable with other mkl facilities. Futher, as elaborated on in another thread already, it would be great if the api contains options for the memory management of array C by the user.

best regards

Kirill_V_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-09-2020
01:04 PM

254 Views

Hi @may_ka,

I agree with your point about the issue in the described workflow. Also I agree that such a workflow makes sense.

I think there is a way currently to avoid the memory allocations, through the recently introduced graph functionality (with multistage properly functioning in MKL 2020u2, u3 and MKL 2020u4, our most recent MKL release).

What this approach can do: it allows the user to allocate memory for the output. So you can compute C1 = A * B in two-stage where you will allocate memory for all output arrays, then initialize C2 with ia and ja from C1 (it would not do any copying, just take in your arrays) and newly allocate values, change the values of A and B, call mkl_graph_mxm with request = MKL_GRAPH_REQUEST_FILL_ENTRIES and it will compute ja and values for C2. And, what is important, mkl_graph_mxm computes the sorted column indices in the output.

As the patterns of A and B remain the same, the pattern of C1 and C2 will be equal so re-computed ja will be the same.

So, this way of using MKL Graph API allows you to avoid creating unnecessary buffers for ja (you will use the same memory for ia and ja and only allocate new arrays for the values) and get sorted column indices for the result of sparse matrix-matrix multiplication. Do you think it makes sense for your workflow?

If you are interested in looking at an example code, we can ask @Gennady_F_Intel to help us with it. In MKL we currently have an example graphc_mxm_multistage which demonstrated how to use multistage approach with graph APIs.

Best,

Kirill

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.