Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

mkl_sparse_spmm wrong when small matrix multiply big matrix

yan__zixi
Beginner
1,045 Views

Hello,

I am trying to multiply two matrixs.

Matrix A : 1 row & 7180 columns & 1341 elements.

Matrix B : 7180 rows & 10001 columns & 372623 elements.

But segment fault occurs when calling SPMM.
 

The error occurs when calling mkl_sparse_spmm with the two matrixs all in CSR format, while the two matrixs seems (the matrix is big so that I can not confirm the correctness) in correct format by printing it (mkl_sparse_s_export_csr).

However, I can not reproduce it in a simple case.

Any idea what the problem is? Is there any useful information?

Thank you very much in advance.

Zixi

The trace stack is as follows:
[kudu57:09298] *** Process received signal ***
[kudu57:09298] Signal: Segmentation fault (11)
[kudu57:09298] Signal code:  (128)
[kudu57:09298] Failing at address: (nil)
[kudu57:09298] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fe587bbe390]
[kudu57:09298] [ 1] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_avx2.so(mkl_sparse_s_csr__g_n_spmm_notr_row_i8_avx2+0x822)[0x7fe5729cefc2]
[kudu57:09298] [ 2] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_gnu_thread.so(+0x597153)[0x7fe58c70f153]
[kudu57:09298] [ 3] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_gnu_thread.so(+0x598815)[0x7fe58c710815]
[kudu57:09298] [ 4] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_gnu_thread.so(mkl_sparse_s_csr__g_n_spmm_i8+0x76b)[0x7fe58c710f9b]
[kudu57:09298] [ 5] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_avx2.so(mkl_sparse_s_do_spmm_i8_avx2+0x47b)[0x7fe5728d14fb]
[kudu57:09298] [ 6] ./libsparse[0x40dd8f]
[kudu57:09298] [ 7] ./libsparse[0x40ccf3]
[kudu57:09298] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fe586423830]
[kudu57:09298] [ 9] ./libsparse[0x40d209]
[kudu57:09298] *** End of error message ***
Segmentation fault (core dumped)
Compile option:
-std=c++17 -DMKL_ILP64 -m64 -I${MKLROOT}/include -Wall -fopenmp
-L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core 
-lgomp -lpthread -lm -ldl -lboost_system -lboost_filesystem -lmpi

 

0 Kudos
17 Replies
yan__zixi
Beginner
1,045 Views

Log:

[kudu57:09298] *** Process received signal ***
[kudu57:09298] Signal: Segmentation fault (11)
[kudu57:09298] Signal code:  (128)
[kudu57:09298] Failing at address: (nil)
[kudu57:09298] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fe587bbe390]
[kudu57:09298] [ 1] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_avx2.so(mkl_sparse_s_csr__g_n_spmm_notr_row_i8_avx2+0x822)[0x7fe5729cefc2]
[kudu57:09298] [ 2] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_gnu_thread.so(+0x597153)[0x7fe58c70f153]
[kudu57:09298] [ 3] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_gnu_thread.so(+0x598815)[0x7fe58c710815]
[kudu57:09298] [ 4] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_gnu_thread.so(mkl_sparse_s_csr__g_n_spmm_i8+0x76b)[0x7fe58c710f9b]
[kudu57:09298] [ 5] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_avx2.so(mkl_sparse_s_do_spmm_i8_avx2+0x47b)[0x7fe5728d14fb]
[kudu57:09298] [ 6] ./libsparse[0x40dd8f]
[kudu57:09298] [ 7] ./libsparse[0x40ccf3]
[kudu57:09298] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fe586423830]
[kudu57:09298] [ 9] ./libsparse[0x40d209]
[kudu57:09298] *** End of error message ***
Segmentation fault (core dumped)

0 Kudos
Khang_N_Intel
Employee
1,045 Views

Hi Zixi,

 

Can you provide the test code so that we can reproduce the issue?

 

Thanks,

Khang

0 Kudos
yan__zixi
Beginner
1,045 Views

Nguyen, Khang T (Intel) wrote:

Hi Zixi,

 

Can you provide the test code so that we can reproduce the issue?

 

Thanks,

Khang

 

Hello Khang,

The original code is too compilcated so I think putting it here won't help any.

So I implement a simpler version with some difference, but I think it can explain some situation of the original code.

This simpler version also core down inside the mkl_sparse_spmm under G++ 5.4.0 but works fine under G++ 7.3.0 .

However, the error also occurs with original code under G++ 7.3.0 .

What's more, this error seems only occurs when small matrix multiply big matrix. The matrix A in data has only one line.
When I duplicate the matrix A to 1000 lines, the error won't happen. But the error happens while only duplicating the matrix A to several lines.

I think it is very strange.

The MKL version is 2019 Update 1. And I also tried 2018.4 but it doesn't work also.

OS: ubuntu 16.04

The code & matrix data is attached below.

Thanks,
Zixi

0 Kudos
yan__zixi
Beginner
1,045 Views

Update:

Although the debug code above will not cause segment fault compiling with G++ 7.3.0, but its result is incorrect.

And I rewrote it in C and compile it with gcc 5.4.0. It will not cause segment fault but its result is incorrect also. The result of the code using C is same as the result of original debug code using C++ under G++ 7.3.0.

This is quite strange.

Here is the code using C.

Thanks,

Zixi

 

0 Kudos
yan__zixi
Beginner
1,045 Views

Hi all,

Is there any progress?

0 Kudos
Kirill_V_Intel
Employee
1,045 Views

Hello, yan!

Sorry for the late reply. Could you explain once more what is not working exactly? I managed to run your C example, it worked fine. But now, how did you understand that the result was incorrect? Did the difference come from compiling with different compilers?

Unfortunately, I didn't get a clear understanding of the setup which works and the setup which doesn't from your previous posts.

Best,
Kirill

 

0 Kudos
yan__zixi
Beginner
1,045 Views

Hello Kirill,

Thank for response.

I suppose you run test_example.c (with the data in debug.tar.xz) and found no segment fault.

The problem is the matrix-c calculated by mkl_sparse_spmm is wrong.

It will be printed by test_example.c at the last of its output. Part of the matrix-c is correct, but obviously some elements are missing and many meaningless trailing zeros in values ​​and col_indx array.
The full log and the correct matrix-c is attached.

Some more information:

1. At first a program has segment fault seems due to memory problem, but that program is too complicated. I simplify the situation to test_example.c. I believe the cause of the matrix-c error and the program segment fault is the same.

2. I found that except for the array values and col_indx, all other result is correct. So I personally guess mkl_sparse_spmm split the matrix multiplication into 2 steps (calculating the number of nonzeros and calculating the values) as many other algorithms do and something is wrong in step 2.

Thanks,

Zixi

0 Kudos
Kirill_V_Intel
Employee
1,045 Views

Hello, yan!

Now I see.

1) Keep in mind that trailing zero's are in general ok and their appearance can be due to the particular algorithm used for the implementation of spmm.

2) The output of mkl_sparse_spmm is not guaranteed to return the sorted output. I mean, the column indices within one row are unsorted. To make them sorted, one should call mkl_sparse_order.

I took both points into account and got the same matrix C as in matrix-c from your archive (if I understood you correctly, matrix-c contains the correct matrix values). The modified test example is attached.

I hope this helps.

Best,
Kirill

0 Kudos
yan__zixi
Beginner
1,045 Views

Hello Kirill!

I ran your test code and get the same matrix C except for the zeros. The printed log in my environment is attached. As you can see, the matrix C is no correct. It misses some nonzero elements.

Can you attach your printed log of the test.cpp in your environment? If your matrix C is correct, I guess it may due to environment or compiling.

Environment: mkl 2019.2, gcc 5.4.0 Linux 4.9.20-040920-generic

The makefile is also attached.

What's more, I think the trailing zeros is abnormal because all elements are nonzero in correct matrix C in this case and it missed some nonzero elements.

Notice that: the number of zeros + the number of nonzeros = the number of nonzeros in correct matrix C.

A reasonable explanation is that the zeros in matrix C should be nonzero elements.

Thanks,

Zixi

0 Kudos
Kirill_V_Intel
Employee
1,045 Views

Hello Zixi,

It's really strange, I'll try your setup. I've attached my output.

Best,
Kirill

 

0 Kudos
yan__zixi
Beginner
1,045 Views

Hello Kirill,

Thanks for your patience.

 

Here is some information may help.

1. Environment details:

  glibc version: (Ubuntu GLIBC 2.23-0ubuntu10) 2.23

  gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609

  Ubuntu 16.04.5 LTS

  MKL: 2019 Update 2

2. I also try verbose mode, but I don't see the calling lines like you. Doest it matter?

 

Thanks,

Zixi

0 Kudos
Kirill_V_Intel
Employee
1,045 Views

Hello, Zixi,

It's fine that your output is a bit different, I was looking at my local version which is different from MKL 2019u2.

I managed to reproduce the issue. We'll investigate it.
Thank you very much for all the information you gave.

 

Best,
Kirill

 

0 Kudos
Gennady_F_Intel
Moderator
1,045 Views

Thanks for noticing this case. this is the bug and it escalated for further investigation. We will keep this post updated with the status of this issue.

0 Kudos
yan__zixi
Beginner
1,045 Views

Hello all,

Can you please tell me when will this bug happen? So maybe I can get rid of it by some tricks. Or it will only happen in specify environment?

Thanks,

Zixi

0 Kudos
Kirill_V_Intel
Employee
1,045 Views

Hello Zixi,

I'm afraid we cannot say anything useful until we investigate it properly. I don't think now that you can get rid of it by changing the environment but this is just my feeling.

What is it what you are doing with the product of the matrices? As a temporary workaround, you possibly can just avoid mkl_sparse_spmm and thus simply not use this product (say, instead of C * x use A * B * x). I understand that it is not the best option but might be a quick temporary fix.

Thanks,
Kirill

0 Kudos
Gennady_F_Intel
Moderator
1,045 Views

Zixi, the fix of the issue available in MKL 2019 u5 which we released the last week. You may check this update and let us know how this update will work on your side.

0 Kudos
yan__zixi
Beginner
1,045 Views

Thanks for the update.

Will update here once I get the result.

0 Kudos
Reply