Hello zixi,

yan__zixi · ‎01-15-2019

Hello,

I am trying to multiply two matrixs.

Matrix A : 1 row & 7180 columns & 1341 elements.

Matrix B : 7180 rows & 10001 columns & 372623 elements.

But segment fault occurs when calling SPMM.

The error occurs when calling mkl_sparse_spmm with the two matrixs all in CSR format, while the two matrixs seems (the matrix is big so that I can not confirm the correctness) in correct format by printing it (mkl_sparse_s_export_csr).

However, I can not reproduce it in a simple case.

Any idea what the problem is? Is there any useful information?

Thank you very much in advance.

Zixi

The trace stack is as follows:
[kudu57:09298] *** Process received signal ***
[kudu57:09298] Signal: Segmentation fault (11)
[kudu57:09298] Signal code:  (128)
[kudu57:09298] Failing at address: (nil)
[kudu57:09298] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fe587bbe390]
[kudu57:09298] [ 1] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_avx2.so(mkl_sparse_s_csr__g_n_spmm_notr_row_i8_avx2+0x822)[0x7fe5729cefc2]
[kudu57:09298] [ 2] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_gnu_thread.so(+0x597153)[0x7fe58c70f153]
[kudu57:09298] [ 3] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_gnu_thread.so(+0x598815)[0x7fe58c710815]
[kudu57:09298] [ 4] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_gnu_thread.so(mkl_sparse_s_csr__g_n_spmm_i8+0x76b)[0x7fe58c710f9b]
[kudu57:09298] [ 5] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_avx2.so(mkl_sparse_s_do_spmm_i8_avx2+0x47b)[0x7fe5728d14fb]
[kudu57:09298] [ 6] ./libsparse[0x40dd8f]
[kudu57:09298] [ 7] ./libsparse[0x40ccf3]
[kudu57:09298] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fe586423830]
[kudu57:09298] [ 9] ./libsparse[0x40d209]
[kudu57:09298] *** End of error message ***
Segmentation fault (core dumped)

Compile option:
-std=c++17 -DMKL_ILP64 -m64 -I${MKLROOT}/include -Wall -fopenmp
-L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core 
-lgomp -lpthread -lm -ldl -lboost_system -lboost_filesystem -lmpi

yan__zixi · ‎01-15-2019

Log:

[kudu57:09298] *** Process received signal ***
[kudu57:09298] Signal: Segmentation fault (11)
[kudu57:09298] Signal code: (128)
[kudu57:09298] Failing at address: (nil)
[kudu57:09298] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fe587bbe390]
[kudu57:09298] [ 1] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_avx2.so(mkl_sparse_s_csr__g_n_spmm_notr_row_i8_avx2+0x822)[0x7fe5729cefc2]
[kudu57:09298] [ 2] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_gnu_thread.so(+0x597153)[0x7fe58c70f153]
[kudu57:09298] [ 3] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_gnu_thread.so(+0x598815)[0x7fe58c710815]
[kudu57:09298] [ 4] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_gnu_thread.so(mkl_sparse_s_csr__g_n_spmm_i8+0x76b)[0x7fe58c710f9b]
[kudu57:09298] [ 5] /Program/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_avx2.so(mkl_sparse_s_do_spmm_i8_avx2+0x47b)[0x7fe5728d14fb]
[kudu57:09298] [ 6] ./libsparse[0x40dd8f]
[kudu57:09298] [ 7] ./libsparse[0x40ccf3]
[kudu57:09298] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fe586423830]
[kudu57:09298] [ 9] ./libsparse[0x40d209]
[kudu57:09298] *** End of error message ***
Segmentation fault (core dumped)

Khang_N_Intel · ‎01-15-2019

Hi Zixi,

Can you provide the test code so that we can reproduce the issue?

Thanks,

Khang

yan__zixi · ‎01-16-2019

Nguyen, Khang T (Intel) wrote:
Hi Zixi,

Can you provide the test code so that we can reproduce the issue?

Thanks,
Khang

Hello Khang,

The original code is too compilcated so I think putting it here won't help any.

So I implement a simpler version with some difference, but I think it can explain some situation of the original code.

This simpler version also core down inside the mkl_sparse_spmm under G++ 5.4.0 but works fine under G++ 7.3.0 .

However, the error also occurs with original code under G++ 7.3.0 .

What's more, this error seems only occurs when small matrix multiply big matrix. The matrix A in data has only one line.
When I duplicate the matrix A to 1000 lines, the error won't happen. But the error happens while only duplicating the matrix A to several lines.

I think it is very strange.

The MKL version is 2019 Update 1. And I also tried 2018.4 but it doesn't work also.

OS: ubuntu 16.04

The code & matrix data is attached below.

Thanks,
Zixi

yan__zixi · ‎01-18-2019

Update:

Although the debug code above will not cause segment fault compiling with G++ 7.3.0, but its result is incorrect.

And I rewrote it in C and compile it with gcc 5.4.0. It will not cause segment fault but its result is incorrect also. The result of the code using C is same as the result of original debug code using C++ under G++ 7.3.0.

This is quite strange.

Here is the code using C.

Thanks,

Zixi

yan__zixi · ‎02-14-2019

Hi all,

Is there any progress?

Kirill_V_Intel · ‎02-15-2019

Hello, yan!

Sorry for the late reply. Could you explain once more what is not working exactly? I managed to run your C example, it worked fine. But now, how did you understand that the result was incorrect? Did the difference come from compiling with different compilers?

Unfortunately, I didn't get a clear understanding of the setup which works and the setup which doesn't from your previous posts.

Best,
Kirill

yan__zixi · ‎02-16-2019

Hello Kirill,

Thank for response.

I suppose you run test_example.c (with the data in debug.tar.xz) and found no segment fault.

The problem is the matrix-c calculated by mkl_sparse_spmm is wrong.

It will be printed by test_example.c at the last of its output. Part of the matrix-c is correct, but obviously some elements are missing and many meaningless trailing zeros in values and col_indx array.
The full log and the correct matrix-c is attached.

Some more information:

1. At first a program has segment fault seems due to memory problem, but that program is too complicated. I simplify the situation to test_example.c. I believe the cause of the matrix-c error and the program segment fault is the same.

2. I found that except for the array values and col_indx, all other result is correct. So I personally guess mkl_sparse_spmm split the matrix multiplication into 2 steps (calculating the number of nonzeros and calculating the values) as many other algorithms do and something is wrong in step 2.

Thanks,

Zixi

Kirill_V_Intel · ‎02-16-2019

Hello, yan!

Now I see.

1) Keep in mind that trailing zero's are in general ok and their appearance can be due to the particular algorithm used for the implementation of spmm.

2) The output of mkl_sparse_spmm is not guaranteed to return the sorted output. I mean, the column indices within one row are unsorted. To make them sorted, one should call mkl_sparse_order.

I took both points into account and got the same matrix C as in matrix-c from your archive (if I understood you correctly, matrix-c contains the correct matrix values). The modified test example is attached.

I hope this helps.

Best,
Kirill

yan__zixi · ‎02-16-2019

Hello Kirill!

I ran your test code and get the same matrix C except for the zeros. The printed log in my environment is attached. As you can see, the matrix C is no correct. It misses some nonzero elements.

Can you attach your printed log of the test.cpp in your environment? If your matrix C is correct, I guess it may due to environment or compiling.

Environment: mkl 2019.2, gcc 5.4.0 Linux 4.9.20-040920-generic

The makefile is also attached.

What's more, I think the trailing zeros is abnormal because all elements are nonzero in correct matrix C in this case and it missed some nonzero elements.

Notice that: the number of zeros + the number of nonzeros = the number of nonzeros in correct matrix C.

A reasonable explanation is that the zeros in matrix C should be nonzero elements.

Thanks,

Zixi

Kirill_V_Intel · ‎02-17-2019

Hello Zixi,

It's really strange, I'll try your setup. I've attached my output.

Best,
Kirill

yan__zixi · ‎02-17-2019

Hello Kirill,

Thanks for your patience.

Here is some information may help.

1. Environment details:

glibc version: (Ubuntu GLIBC 2.23-0ubuntu10) 2.23

gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609

Ubuntu 16.04.5 LTS

MKL: 2019 Update 2

2. I also try verbose mode, but I don't see the calling lines like you. Doest it matter?

Thanks,

Zixi

Kirill_V_Intel · ‎02-17-2019

Hello, Zixi,

It's fine that your output is a bit different, I was looking at my local version which is different from MKL 2019u2.

I managed to reproduce the issue. We'll investigate it.
Thank you very much for all the information you gave.

Best,
Kirill

Gennady_F_Intel · ‎02-17-2019

Thanks for noticing this case. this is the bug and it escalated for further investigation. We will keep this post updated with the status of this issue.

yan__zixi · ‎02-21-2019

Hello all,

Can you please tell me when will this bug happen? So maybe I can get rid of it by some tricks. Or it will only happen in specify environment?

Thanks,

Zixi

Kirill_V_Intel · ‎02-21-2019

Hello Zixi,

I'm afraid we cannot say anything useful until we investigate it properly. I don't think now that you can get rid of it by changing the environment but this is just my feeling.

What is it what you are doing with the product of the matrices? As a temporary workaround, you possibly can just avoid mkl_sparse_spmm and thus simply not use this product (say, instead of C * x use A * B * x). I understand that it is not the best option but might be a quick temporary fix.

Thanks,
Kirill

Gennady_F_Intel · ‎09-15-2019

Zixi, the fix of the issue available in MKL 2019 u5 which we released the last week. You may check this update and let us know how this update will work on your side.

yan__zixi · ‎09-28-2019

Thanks for the update.

Will update here once I get the result.

mkl_sparse_spmm wrong when small matrix multiply big matrix