Bug in mkl_sparse_spmm (?)

Schouten__Doug · ‎06-24-2019

Hello,

I am finding very strange results from mkl_sparse_spmm()

My code looks like below.

The strange thing is:

1. if I pass the routine a "small" pair of sparse matrices (A[m x k] dot B[k x n], m = 40,000, k = 1000, n = 40,000, sparsity is ~ 99% and 65% for B and A, respectively), then the result is correct; by correct, I mean that the diagonal of the resultant sparse matrix has the correct spectrum (known a priori), and element by element comparison to the result from scipy.sparse matrix product is within floating point error

2. however, if I pass it a larger pair of matrices (m ~ 70k, n ~ 70k, similar sparsity) then the result is strange; notably, "number of 0 elements" printed in the code is almost as large as the total number of elements returned (i.e., rows_e[n-1] - nzero ~ few 10k out of O(100M) total), which means that mkl_sparse_spmm is reporting a large number of non-zero elements that are actually set to zero in the output array; the diagonal elements are all crazy, and comparison to scipy.sparse matrix product is waaaay off

Are there any guesses what I might be doing wrong? Here are my compiler settings (gcc v6.3):

CFLAGS = -march=native -m64 -fPIC -O3 -fopenmp -DMKL_ILP64

LDFLAGS = -pthread -L$(EXTERN)/usr/lib -lgomp -lboost_python -lboost_numpy -lpython$(PYVERS) \
-L/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 \
-lmkl_intel_ilp64 -lmkl_core -lmkl_def -lmkl_rt -lmkl_gnu_thread

SYSTEM: MKL version is 2019.4.243; system is Intel(R) Xeon(R) CPU E5-2630 v4 dual socket with 64GB RAM, g++ (GCC) 6.3.0 / CENTOS 7

Output on the Python side:

G shape = (1296, 64827)
Range(G) = [3.76e-06, 4.31]
Sparsity of G = 99.721%
Sparsity of G_g[64827 x 1296] = 0.529
Range of G_g = [-0.00227, 0.000266]

>>>  p = numeric.mkl.sp_mm_csr(m, k, n, G_g.count_nonzero(), G_g.data, G_g.indptr.astype(int), G_g.indices.astype(int), G.count_nonzero(), G.data, G.indptr.astype(int), G.indices.astype(int))
number of 0 elements: 881856376

>>> Rp = csr_matrix(p(), shape=(m,n))
>>> Rp.count_nonzero()
9134960
>>> R = G_g * G
>>> R.count_nonzero()
890991336

## NOTE: 890991336 - 9134960 = 881856376

Somewhat redacted C++ side code is below (I don't think I have removed anything that is relevant ...).

sp_mm_csr(MKL_INT m, MKL_INT k, MKL_INT n, MKL_INT nnz_a, np::ndarray a_vals, np::ndarray a_rows, np::ndarray a_cols, MKL_INT nnz_b, np::ndarray b_vals, np::ndarray b_rows, np::ndarray b_cols) {

      // np::ndarray are boost::python::numpy arrays; omitted code ensures that 
      // all arrays have correct types (np.int64 and double) and dimensions

      mkl_sparse_d_create_csr(
        &csr_a, SPARSE_INDEX_BASE_ZERO, m, k,
        reinterpret_cast<MKL_INT *>(a_rows.get_data()),
        reinterpret_cast<MKL_INT *>(a_rows.get_data()) + 1,
        reinterpret_cast<MKL_INT *>(a_cols.get_data()),
        reinterpret_cast<sp_mm_csr_res::real_t *>(a_vals.get_data()));

      mkl_sparse_d_create_csr(
        &csr_b, SPARSE_INDEX_BASE_ZERO, k, n,
        reinterpret_cast<MKL_INT *>(b_rows.get_data()),
        reinterpret_cast<MKL_INT *>(b_rows.get_data()) + 1,
        reinterpret_cast<MKL_INT *>(b_cols.get_data()),
        reinterpret_cast<sp_mm_csr_res::real_t *>(b_vals.get_data()));

     sparse_matrix_t csr_c = NULL;
     auto status =
        mkl_sparse_spmm(SPARSE_OPERATION_NON_TRANSPOSE, csr_a, csr_b, &csr_c);

    // m_XX variables are all MKL_INT or real_t = double class members

    MKL_INT *rows_b;
    MKL_INT *rows_e;
    mkl_sparse_d_export_csr(csr_c, &m_indexing, &m_nrows, &m_ncols, &rows_b,
                            &rows_e, &m_mat_cols, &m_mat_vals);

    m_nnz = rows_e[m_nrows - 1];

    MKL_INT nzero = 0;
    for(MKL_INT i=0; i < m_nnz; ++i) {
      if ((*(m_mat_vals + i)) == 0) {
        nzero += 1;
      }
    }
    std::cout << "number of 0 elements: " << nzero << std::endl;

    // ... 
}

Gennady_F_Intel · ‎06-24-2019

It might be a threading problem. Could you try to link with -lmkl_sequential dll instead of lmkl_gnu_thread and check if the problem will still exist.

Schouten__Doug · ‎06-25-2019

Thanks for the suggestion; I tried to link to mkl_sequential, but I get the same behaviour ... :(

ldd numeric/mkl/libmkl.so # my library ...
        linux-vdso.so.1 => (0x00007ffcaa7f1000)
        libgomp.so.1 => /opt/sw/gcc6/usr/lib64/libgomp.so.1 (0x00007f87dd9d7000)
        libboost_python.so.1.63.0 => /opt/sw/gcc6/usr/lib/libboost_python.so.1.63.0 (0x00007f87dd78e000)
        libboost_numpy.so.1.63.0 => /opt/sw/gcc6/usr/lib/libboost_numpy.so.1.63.0 (0x00007f87dd582000)
        libpython2.7.so.1.0 => /opt/sw/gcc6/usr/lib/libpython2.7.so.1.0 (0x00007f87dd163000)
        libmkl_intel_ilp64.so => /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64/libmkl_intel_ilp64.so (0x00007f87dc6a1000)
        libmkl_core.so => /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64/libmkl_core.so (0x00007f87d83cc000)
        libmkl_def.so => /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64/libmkl_def.so (0x00007f87d5bab000)
        libmkl_rt.so => /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64/libmkl_rt.so (0x00007f87d54c4000)
        libmkl_sequential.so => /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64/libmkl_sequential.so (0x00007f87d3f2b000)
        libstdc++.so.6 => /opt/sw/gcc6/usr/lib64/libstdc++.so.6 (0x00007f87d3baa000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f87d38a8000)
        libgcc_s.so.1 => /opt/sw/gcc6/usr/lib64/libgcc_s.so.1 (0x00007f87d3691000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f87d3475000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f87d30a8000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f87d2ea4000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00007f87d2ca1000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f87d2a99000)
        libz.so.1 => /opt/sw/gcc6/usr/lib/libz.so.1 (0x00007f87d287d000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f87dde17000)

Kirill_V_Intel · ‎06-25-2019

Hello Doug,

You have a link line which looks suspicious to me. In particular I am not sure whether you need any mkl libraries if you're linking mkl_rt. What I'm used to is what is recommended by the MKL Link Line Advisor (but it is not for Python), https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/.

Also, your way of computing the number of actual zeros in the resulting matrix is error-prone, since you're comparing floating point numbers with 0. I'd replace that by a check whether the absolute value is less than some small tolerance.

To understand whether it causes the incorrect result, I'd try to run a spmm example (shipped with mkl) with your matrices as a standalone C code. It would be easy if you can provide your matrices to us in some simple format (say, ia.txt, ja.txt, vala.txt for CSR arrays of matrix A and similar for B) and arrays from scipy with presumably correct values for the output C.

Best,
Kirill

Schouten__Doug · ‎06-25-2019

I understand that "== 0" for floating point numbers is not generally a good way of testing for zero, but if it evaluates to true then the number really is exactly 0; since I am interested in whether there are many exactly 0 entries that should be !=0, regardless of how far from 0 they are, this is a vaild test. Furthermore, when I inspect the matrix diagonals and use the matrix for further calculations, I can see that the results are incorrect, so this is just a post-error diagnosis step. The underlying error is somewhere else.

I looked at the link line advisor and I will update my linking arguments to match this exactly. I will look into uploading the sparse matrices here for others to try, but the attachments might be too big. I will try myself with the standalone program as well.

thanks! Doug

Schouten__Doug · ‎06-25-2019

Ok, this is strange ... https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/ gives me this link line:

-L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl

I modify my build, to use this link line:

-Wall -Wl,-z,defs -Werror -fPIC -shared -L/opt/sw/gcc6/usr/lib -lboost_python -lboost_numpy -lpython2.7 \
-L/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl

and recompile/rebuild, and now I get this error when I run my Python script that calls the spmm C++ routine that I wrote (see above):

INTEL MKL ERROR: /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64/libmkl_avx2.so: undefined symbol: mkl_sparse_optimize_bsr_trsm_i8.
Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so.

So, that is weird ... maybe a clue? Note that the link line advisor doesn't have my exact version of MKL (2019.4, it only goes up to 2019.0).

The matrices are at the links below, in Matrix Market format:

Matrix A: https://drive.google.com/file/d/1qjts4gS1GVZ8r73ZQ5clWUaPhGRvCKCK/view?usp=sharing

Matrix B: https://drive.google.com/file/d/1nl3Ul-PbjcAIlG4O7xgJ6MaURGYDm49n/view?usp=sharing

(trying to calculate A * B)

Schouten__Doug · ‎06-26-2019

Saved the input matrices in Python in Matrix Market format (.mtx files). Wrote a separate test program below. Compiled with:

g++ -std=c++11 -fPIC -I/opt/intel/compilers_and_libraries_2019.4.243/linux/mkl/include \
-DMKL_ILP64 -o test_mkl.o -c test_mkl.cpp
g++ -o test_mkl.exe -L/opt/intel/compilers_and_libraries_2019.4.243/linux/mkl/lib/intel64 \
-Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lpthread -lm -ldl -lgomp test_mkl.o mmio.o

Output of the program is:

Reading matrix A: [===========]
Reading matrix B: [===========]
# elements == 0 is 881856376
DONE

Exactly the same as the original code above; so its not something in the Python/C++ iface I had written.

#include <cstdio>
#include <iostream>

#include "mmio.h"
#include "mkl.h"

int main(int argc, char *argv[])
{
	int ret_code;
	MM_typecode matcode;

	FILE *fA = fopen("A.mtx", "r");
	FILE *fB = fopen("B.mtx", "r");

	if (fA == NULL || fB == NULL) {
		return -1;
	}

	/* ============== A ============== */

	/* peek matrix A .... */

	if (mm_read_banner(fA, &matcode) != 0) {
		std::cout << "could not process header A" << std::endl;
		return -1;
	}


	/* find out size of sparse matrix .... */

	int im_a, in_a, innz_a;

	if ((ret_code = mm_read_mtx_crd_size(fA, &im_a, &in_a, &innz_a)) != 0)
		return ret_code;

	MKL_INT m_a = im_a;
	MKL_INT n_a = in_a;
	MKL_INT nnz_a = innz_a;

	/* reserve memory for matrices */

	MKL_INT* irow_a = new MKL_INT[nnz_a];
	MKL_INT* icol_a = new MKL_INT[nnz_a];
	double* val_a = new double[nnz_a];

	/* fill the arrays */

	std::cout << "Reading matrix A: [" << std::flush;

	for (MKL_INT i = 0; i < nnz_a; i++)
	{
		fscanf(fA, "%d %d %lg\n", &irow_a, &icol_a, &val_a);
		irow_a--;  /* adjust from 1-based to 0-based */
		icol_a--;
		if (i % (nnz_a / 10) == 0) {
			std::cout << "=" << std::flush;
		}
	}
	std::cout << "]" << std::endl;

	sparse_matrix_t coo_a = NULL;
	if (SPARSE_STATUS_SUCCESS != mkl_sparse_d_create_coo(
	            &coo_a, SPARSE_INDEX_BASE_ZERO, m_a, n_a,
	            nnz_a, irow_a, icol_a, val_a)) {
		return -1;
	}

	sparse_matrix_t csr_a = NULL;
	if (SPARSE_STATUS_SUCCESS != mkl_sparse_convert_csr(coo_a, SPARSE_OPERATION_NON_TRANSPOSE, &csr_a)) {
		return -1;
	}


	/* ============== B ============== */

	/* peek matrix B .... */

	if (mm_read_banner(fB, &matcode) != 0) {
		std::cout << "could not process header A" << std::endl;
		return -1;
	}

	/* find out size of sparse matrix .... */

	int im_b, in_b, innz_b;

	if ((ret_code = mm_read_mtx_crd_size(fB, &im_b, &in_b, &innz_b)) != 0)
		return ret_code;

	MKL_INT m_b = im_b;
	MKL_INT n_b = in_b;
	MKL_INT nnz_b = innz_b;

	/* reserve memory for matrices */

	MKL_INT* irow_b = new MKL_INT[nnz_b];
	MKL_INT* icol_b = new MKL_INT[nnz_b];
	double* val_b = new double[nnz_b];

	/* fill the arrays */

	std::cout << "Reading matrix B: [" << std::flush;

	for (MKL_INT i = 0; i < nnz_b; i++)
	{
		fscanf(fB, "%d %d %lg\n", &irow_b, &icol_b, &val_b);
		irow_b--;  /* adjust from 1-based to 0-based */
		icol_b--;
		if (i % (nnz_b / 10) == 0) {
			std::cout << "=" << std::flush;
		}
	}
	std::cout << "]" << std::endl;

	sparse_matrix_t coo_b = NULL;
	if (SPARSE_STATUS_SUCCESS != mkl_sparse_d_create_coo(
	            &coo_b, SPARSE_INDEX_BASE_ZERO, m_b, n_b,
	            nnz_b, irow_b, icol_b, val_b)) {
		return -1;
	}

	sparse_matrix_t csr_b = NULL;
	if (SPARSE_STATUS_SUCCESS != mkl_sparse_convert_csr(coo_b, SPARSE_OPERATION_NON_TRANSPOSE, &csr_b)) {
		return -1;
	}

	sparse_matrix_t csr_c = NULL;
	auto status =
	    mkl_sparse_spmm(SPARSE_OPERATION_NON_TRANSPOSE, csr_a, csr_b, &csr_c);
	if (status != SPARSE_STATUS_SUCCESS) {
		switch (status) {
		case SPARSE_STATUS_ALLOC_FAILED:
			std::cerr << "mkl error: memory allocation fault" << std::endl;
			return -1;
			break;
		case SPARSE_STATUS_EXECUTION_FAILED:
			std::cerr << "mkl error: execution fault" << std::endl;
			return -1;
			break;
		case SPARSE_STATUS_INVALID_VALUE:
			std::cerr << "mkl error: invalid value" << std::endl;
			return -1;
			break;
		default:
			break;
		}
	}

	if (csr_c == NULL || status != SPARSE_STATUS_SUCCESS) {
		std::cerr << "unknown mkl error" << std::endl;
		return -1;
	}


	sparse_index_base_t indexing_c;
	MKL_INT nrows_c, ncols_c;
	MKL_INT nnz_c;
	MKL_INT *rows_c;
	MKL_INT *cols_c;
	double *vals_c;
	
	MKL_INT *rows_beg;
	MKL_INT *rows_end;

	mkl_sparse_d_export_csr(csr_c, &indexing_c, &nrows_c, &ncols_c, &rows_beg, &rows_end, &cols_c, &vals_c);

	rows_c = new MKL_INT[nrows_c + 1];
	for (MKL_INT i = 0; i < nrows_c; ++i) {
		rows_c = rows_beg;
	}
	rows_c[nrows_c] = rows_end[nrows_c - 1];
	nnz_c = rows_c[nrows_c];

	MKL_INT nzero = 0;
	for (MKL_INT i = 0; i < nnz_c; ++i) {
		if ((*(vals_c + i)) == 0.) {
			nzero += 1;
		}
	}
	std::cout << "# elements == 0 is " << nzero << std::endl;
	std::cout << "DONE" << std::endl;
	return 0;
}

Schouten__Doug · ‎07-08-2019

Anyone?

Kirill_V_Intel · ‎07-09-2019

Hi Doug,

Sorry for the late response. I've looked at your example (managed to run it, thanks for making a standalone reproducer) but I actually see the reported number of zeros equal to 0 with the latest internal version of MKL as well as for the future MKL 2019u5 (but 882472701 for MKL 2019u4). So, maybe it has been fixed already.

Not to bother about comparison with 0.0, could you specify which entry of the output matrix is incorrect? You said earlier that some of diagonal values are wrong. can you specify which ones (actually, one entry is enough).

Best,
Kirill