Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Segfault in multithreaded dcsrmv

asd__asdqwe
Beginner
425 Views

Hello,

I have a weird problem in the code attached. When OMP_NUM_THREADS=1, I don't have any segmentation fault. When OMP_NUM_THREADS>1, it segfaults, unless I uncomment the lines 56 to 60 (i.e. if I first compute Ax, then A^T x'). Do you see where might be the problem ? I'm using icpc version 13.0.0 (gcc version 4.7.0 compatibility) on a Debian comp. Could this be linked to my (currently unresolved) other problem http://software.intel.com/en-us/forums/topic/336409 ?

Thanks in advance for your help.

Compiler is called as follows: icpc dcsrmv_segfault.cpp -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5

LDD returns:

        linux-vdso.so.1 =>  (0x00007fffe37ff000)
        libmkl_intel_lp64.so => XXX/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_intel_lp64.so (0x00007fc6f1c4a000)
        libmkl_intel_thread.so => XXX/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_intel_thread.so (0x00007fc6f0cd5000)
        libmkl_core.so => XXX/composer_xe_2013.0.079/mkl/lib/intel64/libmkl_core.so (0x00007fc6efada000)
        libiomp5.so => XXX/composer_xe_2013.0.079/compiler/lib/intel64/libiomp5.so (0x00007fc6ef7e2000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fc6ef54b000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fc6ef244000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fc6ef02e000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc6eeca6000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fc6eeaa2000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fc6ee886000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fc6f2398000)

Edit: new .tar.gz

0 Kudos
11 Replies
Gennady_F_Intel
Moderator
425 Views
I checked how it works on Win7, 64 bit and even in all cases when the sequential versions of MKL has been used, I saw a lot of NAN into outputs. pls check if the CSR representation is correct.
0 Kudos
asd__asdqwe
Beginner
425 Views
Hello, If you take my program "as is" and simply add printf at the end of the main(), of course you will get NaN, since 'x' and 'xT' are uninitialized arrays (this is just a test program I wrote to upload here, I don't really care about the results stored in 'x' and 'xT' in this piece of code). I uploaded a new version that initializes both working vectors (and I don't see NaN anymore). I don't think my CSR is wrong to be honest, as I'm using this kind of matrix in a much larger production code, and when I'm using OMP_NUM_THREADS=1, the results are correct. By the way, I just got a little bit more than "Segmentation fault", here is the stderr: a.out: malloc.c:3096: sYSMALLOc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed.
0 Kudos
Gennady_F_Intel
Moderator
425 Views
quote:"I uploaded a new version that initializes both working vectors (and I don't see NaN anymore)...." Where is the new version of your code?
0 Kudos
asd__asdqwe
Beginner
425 Views
It is updated in my first post (dcsrmv-initialized.tar.gz), sorry if I wasn't clear enough.
0 Kudos
asd__asdqwe
Beginner
425 Views
Hello. Is someone able to reproduce my problem ?
0 Kudos
Gennady_F_Intel
Moderator
425 Views
the updated example works w/o problem on my system with mkl 11.0 update 1. the attached log is the results I have got.
0 Kudos
asd__asdqwe
Beginner
425 Views
Hello, Are you sure this works even with OMP_NUM_THREADS > 1 ? As I can see, I'm not the only one having this issue, c.f. http://software.intel.com/en-us/forums/topic/344909 Here is my BT : OMP_NUM_THREADS=1 [New Thread 10154 (LWP 10154)] n = 5100 nz = 29748 I.n = 5101 I(I.n) = 29748 5100 x 9915 Program exited normally. OMP_NUM_THREADS=2 [New Thread 1095 (LWP 1095)] n = 5100 nz = 29748 I.n = 5101 I(I.n) = 29748 5100 x 9915 [New Thread 1272 (LWP 1272)] [New Thread 1273 (LWP 1273)] Program received signal SIGSEGV mkl_spblas_lp64_mc3_dcsr0tg__c__mvout_par () in XXX/mkl/lib/intel64/libmkl_mc3.so (idb) bt #0 0x00002ba77859d90f in mkl_spblas_lp64_mc3_dcsr0tg__c__mvout_par () in XXX/mkl/lib/intel64/libmkl_mc3.so #1 0x00002ba772978eac in mkl_spblas_lp64_dcsr0tg__c__mvout_omp () in XXX/mkl/lib/intel64/libmkl_intel_thread.so #2 0x00002ba7729791b1 in mkl_spblas_lp64_dcsr0tg__c__mvout_omp () in XXX/mkl/lib/intel64/libmkl_intel_thread.so #3 0x00002ba772793cdc in mkl_spblas_lp64_mkl_dcsrmv () in XXX/mkl/lib/intel64/libmkl_intel_thread.so #4 0x00002ba7736e3d19 in mkl_dcsrmv () in XXX/mkl/lib/intel64/libmkl_rt.so #5 0x0000000000405f73 in main () at XXX/dcsrmv_segfault.cpp:67 #6 0x000000328141ecdd in __libc_start_main () in /lib64/libc-2.12.so In both cases, mkl_spblas_lp64_dcsr0tg__c__mvout_omp seems to go wrong (the other thread concerns dcsCmv, but it seems like it is calling some dcsRmv subroutines) Thanks for your help. By the way, on another piece of code, with Sandy Bridge-E, I also get segfaults in libmkl_avx.so(mkl_spblas_lp64_avx_dcsr0tg__c__mvout_par+0x281) I think that there is something wrong in dcsr0tg, whether with SSE or AVX instructions (once again, the problem doesn't show up when OMP_NUM_THREADS=1 ...).
0 Kudos
asd__asdqwe
Beginner
425 Views
Please, I really don't know what to do here. Could you at least tell me if it is working on your side ? (on Linux, with more than one thread)
0 Kudos
Gennady_F_Intel
Moderator
425 Views
yes, i have accidently checked the case with the single threads and didn't see the problelm. yes, we see the problem now when #thr>1. we will investigate the case and will let you know the update.
0 Kudos
asd__asdqwe
Beginner
425 Views
Alright that is great news ! I guess fixing this problem will also fix the problem with dcscmv encoutered here http://software.intel.com/en-us/forums/topic/344909. Please let me know as soon as you can for a possible patch/fix, thanks a lot for your help.
0 Kudos
Gennady_F_Intel
Moderator
425 Views

Hello, 

This issue has been fixed in MKL v.11.0 update 2 released yesterday.

You can download this update from intel registration center and check the problem on your side.

--Gennady

0 Kudos
Reply