Segfault with 8 threads linking against MKL 11.0.3

Paolo_Tosco · ‎07-08-2013

Dear all,

I am experiencing a segfault on Linux with an application of mine when I link it against MKL shipped with composer_xe_2013.3.163 (Update 3 - March 2013), which should be 11.0.3 according to http://software.intel.com/en-us/articles/which-version-of-the-intel-ipp-intel-mkl-and-intel-tbb-libraries-are-included-in-the-intel.

My application is multi-threaded and it uses pthreads. The segfault happens in cblas_dgemm when I spawn 8 threads: runs with 1, 2 or 4 threads work fine. I am linking against libmkl_intel_lp64.so, libmkl_core.so, libmkl_sequential.so. I have the following environment:

MKL_DISABLE_FAST_MM=1
MKL_SERIAL=YES
MKL_NUM_THREADS=1

The same binary compiled with composer_xe_2013.3.163 runs perfectly on 8 threads if I point LD_LIBRARY_PATH to the MKL libraries shipped with Intel Compilers version 11.1.069. So it really seems to be a version-specific issue.

I have tried to set:

ulimit -s unlimited
MKL_DOMAIN_NUM_THREADS="MKL_DOMAIN_ALL=1"
OMP_NUM_THREADS=1
OMP_DYNAMIC=FALSE
MKL_DYNAMIC=FALSE
OMP_NESTED=FALSE

but it makes no difference. Here I attach the valgrind trace:

==20175== Thread 3:
==20175== Invalid read of size 8
==20175==    at 0x53DB0DA: mkl_serv_malloc (in /mnt/XI/prog/compilers/intel/ics_2013/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_core.so)
==20175==    by 0x860C01B: mkl_blas_mc_dgemm_get_bufs (in /mnt/XI/prog/compilers/intel/ics_2013/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_mc.so)
==20175==    by 0x8684768: mkl_blas_mc_xdgemm_par (in /mnt/XI/prog/compilers/intel/ics_2013/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_mc.so)
==20175==    by 0x8683B4B: mkl_blas_mc_xdgemm (in /mnt/XI/prog/compilers/intel/ics_2013/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_mc.so)
==20175==    by 0x53ED8DB: mkl_blas_xdgemm (in /mnt/XI/prog/compilers/intel/ics_2013/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_core.so)
==20175==    by 0x662A7CE: mkl_blas_dgemm (in /mnt/XI/prog/compilers/intel/ics_2013/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_sequential.so)
==20175==    by 0x4CF0AA8: DGEMM (in /mnt/XI/prog/compilers/intel/ics_2013/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_intel_lp64.so)
==20175==    by 0x4D02452: cblas_dgemm (in /mnt/XI/prog/compilers/intel/ics_2013/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_intel_lp64.so)
==20175==    by 0x455204: pred_y_values (in /mnt/XI/home/toscopa1/open3dtools/bin/open3dqsar)
==20175==    by 0x4689F7: lmo_cv_thread (in /mnt/XI/home/toscopa1/open3dtools/bin/open3dqsar)
==20175==    by 0x3B0A00683C: start_thread (in /lib64/libpthread-2.5.so)
==20175==    by 0x3B094D4F8C: clone (in /lib64/libc-2.5.so)
==20175== Address 0xd0 is not stack'd, malloc'd or (recently) free'd
==20175==
==20175==
==20175== Process terminating with default action of signal 11 (SIGSEGV)
==20175== Access not within mapped region at address 0xD0
==20175==    at 0x53DB0DA: mkl_serv_malloc (in /mnt/XI/prog/compilers/intel/ics_2013/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_core.so)
==20175==    by 0x860C01B: mkl_blas_mc_dgemm_get_bufs (in /mnt/XI/prog/compilers/intel/ics_2013/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_mc.so)
==20175==    by 0x8684768: mkl_blas_mc_xdgemm_par (in /mnt/XI/prog/compilers/intel/ics_2013/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_mc.so)
==20175==    by 0x8683B4B: mkl_blas_mc_xdgemm (in /mnt/XI/prog/compilers/intel/ics_2013/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_mc.so)
==20175==    by 0x53ED8DB: mkl_blas_xdgemm (in /mnt/XI/prog/compilers/intel/ics_2013/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_core.so)
==20175==    by 0x662A7CE: mkl_blas_dgemm (in /mnt/XI/prog/compilers/intel/ics_2013/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_sequential.so)
==20175==    by 0x4CF0AA8: DGEMM (in /mnt/XI/prog/compilers/intel/ics_2013/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_intel_lp64.so)
==20175==    by 0x4D02452: cblas_dgemm (in /mnt/XI/prog/compilers/intel/ics_2013/composer_xe_2013.3.163/mkl/lib/intel64/libmkl_intel_lp64.so)
==20175==    by 0x455204: pred_y_values (in /mnt/XI/home/toscopa1/open3dtools/bin/open3dqsar)
==20175==    by 0x4689F7: lmo_cv_thread (in /mnt/XI/home/toscopa1/open3dtools/bin/open3dqsar)
==20175==    by 0x3B0A00683C: start_thread (in /lib64/libpthread-2.5.so)
==20175==    by 0x3B094D4F8C: clone (in /lib64/libc-2.5.so)

I consistenly get this error on "Address 0xd0".
As I mentioned, my program works perfectly when linked against older Intel MKL versions, as well as ATLAS or Sun Performance LIbrary.
I would be very glad if you could indicate a way to solve my problem.

Thanks, best regards,
Paolo

Paolo_Tosco · ‎07-08-2013

Just wish to add a detail which may help: the issue happens only when I make repeated calls to cblas_dgemm during an iteration. The iteration in a function where I call cblas_dgemm 3-4 times, and it does not always happen on the same call, but in a random fashion. Calling mkl_free_buffers() after each iteration does not make a difference (as expected, since I'm running with MKL_DISABLE_FAST_MM=1).

Paolo

Paolo_Tosco · ‎07-08-2013

Sorry for replying to myself, but I just realized that the problem disappears updating to the latest Composer_xe_2013 bundle 2013.5.192 - so it really looks like it was a bug in 2013.3.163.

Thanks all the same, cheers
Paolo

Gennady_F_Intel · ‎07-08-2013

yes, it may be the well known issue introduced in 11.0.update 4. pls see more detail from here : http://software.intel.com/en-us/articles/svd-multithreading-bug-in-mkl

Paolo_Tosco · ‎07-24-2013

Dear Gennady,

also using the lastest ICC compiler and MKL libraries I still get a bunch of valgrind warnings about possible data races (I am using 8 threads). The same program compiled with gcc and linked against ATLAS libraries does not raise any valgrind warning. The two programs give exactly the same results, and the results do not change when running on 1 thread or 8 threads, so I would conclude the warnings on ICC are harmless. Could you please confirm that? Can you guess what might be the reason of those warnings?

Many thanks in advance, best regards
Paolo

$ valgrind --tool=helgrind open3dqsar.icc2013 -i sample_input_MM2.inp -o sample_input_MM2.out.icc2013

==9381== Helgrind, a thread error detector
==9381== Copyright (C) 2007-2011, and GNU GPL'd, by OpenWorks LLP et al.
==9381== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==9381== Command: /home/ptosco/open3dtools/bin/open3dqsar.icc2013 -i sample_input_MM2.inp -o /dev/null
==9381==
==9382== Warning: invalid file descriptor 1014 in syscall close()
==9383== Warning: invalid file descriptor 1014 in syscall close()
==9384== Warning: invalid file descriptor 1014 in syscall close()
==9385== Warning: invalid file descriptor 1014 in syscall close()
==9381== ---Thread-Announcement------------------------------------------
==9381==
==9381== Thread #2 was created
==9381==    at 0x36B8EE769E: clone (in /lib64/libc-2.12.so)
==9381==    by 0x36B960673F: do_clone.clone.0 (in /lib64/libpthread-2.12.so)
==9381==    by 0x36B9606C21: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.12.so)
==9381==    by 0x4A0B97C: pthread_create_WRK (hg_intercepts.c:255)
==9381==    by 0x4A0BA90: pthread_create@* (hg_intercepts.c:286)
==9381==    by 0x4224C6: calc_field (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x4174ED: parse_input (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x406940: main (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==
==9381== ---Thread-Announcement------------------------------------------
==9381==
==9381== Thread #4 was created
==9381==    at 0x36B8EE769E: clone (in /lib64/libc-2.12.so)
==9381==    by 0x36B960673F: do_clone.clone.0 (in /lib64/libpthread-2.12.so)
==9381==    by 0x36B9606C21: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.12.so)
==9381==    by 0x4A0B97C: pthread_create_WRK (hg_intercepts.c:255)
==9381==    by 0x4A0BA90: pthread_create@* (hg_intercepts.c:286)
==9381==    by 0x4224C6: calc_field (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x4174ED: parse_input (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x406940: main (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==
==9381== ----------------------------------------------------------------
==9381==
==9381== Possible data race during read of size 8 at 0x7E4B20 by thread #2
==9381== Locks held: none
==9381==    at 0x4FB9C0: __svml_rint2 (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x4259D8: calc_mm_thread (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x4A0BB19: mythread_wrapper (hg_intercepts.c:219)
==9381==    by 0x36B9607850: start_thread (in /lib64/libpthread-2.12.so)
==9381==    by 0x7A646FF: ???
==9381==
==9381== This conflicts with a previous write of size 8 by thread #4
==9381== Locks held: none
==9381==    at 0x4FBA22: __svml_rint2_dispatch_table_init (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==
==9381== ---Thread-Announcement------------------------------------------
==9381==
==9381== Thread #19 was created
==9381==    at 0x36B8EE769E: clone (in /lib64/libc-2.12.so)
==9381==    by 0x36B960673F: do_clone.clone.0 (in /lib64/libpthread-2.12.so)
==9381==    by 0x36B9606C21: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.12.so)
==9381==    by 0x4A0B97C: pthread_create_WRK (hg_intercepts.c:255)
==9381==    by 0x4A0BA90: pthread_create@* (hg_intercepts.c:286)
==9381==    by 0x446816: parallel_cv (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x46A2B5: uvepls (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x40E059: parse_input (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x406940: main (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==
==9381== ---Thread-Announcement------------------------------------------
==9381==
==9381== Thread #18 was created
==9381==    at 0x36B8EE769E: clone (in /lib64/libc-2.12.so)
==9381==    by 0x36B960673F: do_clone.clone.0 (in /lib64/libpthread-2.12.so)
==9381==    by 0x36B9606C21: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.12.so)
==9381==    by 0x4A0B97C: pthread_create_WRK (hg_intercepts.c:255)
==9381==    by 0x4A0BA90: pthread_create@* (hg_intercepts.c:286)
==9381==    by 0x446816: parallel_cv (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x46A2B5: uvepls (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x40E059: parse_input (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x406940: main (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==
==9381== ----------------------------------------------------------------
==9381==
==9381== Possible data race during read of size 4 at 0x6575EDC by thread #19
==9381== Locks held: none
==9381==    at 0x53E2207: mkl_serv_lock (in /opt/intel/composer_xe_2013.5.192/mkl/lib/intel64/libmkl_core.so)
==9381==    by 0x7F: ???
==9381==
==9381== This conflicts with a previous write of size 4 by thread #18
==9381== Locks held: none
==9381==    at 0x53E2220: mkl_serv_unlock (in /opt/intel/composer_xe_2013.5.192/mkl/lib/intel64/libmkl_core.so)
==9381==    by 0x7F: ???
==9381==
==9381== ----------------------------------------------------------------
==9381==
==9381== Possible data race during read of size 4 at 0x6575EA0 by thread #19
==9381== Locks held: none
==9381==    at 0x53DFD5D: mkl_serv_malloc (in /opt/intel/composer_xe_2013.5.192/mkl/lib/intel64/libmkl_core.so)
==9381==    by 0x4D846BD: DGETRF (in /opt/intel/composer_xe_2013.5.192/mkl/lib/intel64/libmkl_intel_lp64.so)
==9381==    by 0x458970: pred_y_values (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x46C1C7: lmo_cv_thread (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x4A0BB19: mythread_wrapper (hg_intercepts.c:219)
==9381==    by 0x36B9607850: start_thread (in /lib64/libpthread-2.12.so)
==9381==    by 0x98676FF: ???
==9381==
==9381== This conflicts with a previous write of size 4 by thread #18
==9381== Locks held: none
==9381==    at 0x53DFD88: mkl_serv_malloc (in /opt/intel/composer_xe_2013.5.192/mkl/lib/intel64/libmkl_core.so)
==9381==    by 0x4D846BD: DGETRF (in /opt/intel/composer_xe_2013.5.192/mkl/lib/intel64/libmkl_intel_lp64.so)
==9381==    by 0x458970: pred_y_values (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x46C1C7: lmo_cv_thread (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x4A0BB19: mythread_wrapper (hg_intercepts.c:219)
==9381==    by 0x36B9607850: start_thread (in /lib64/libpthread-2.12.so)
==9381==    by 0xA2686FF: ???
==9381==
==9381== ----------------------------------------------------------------
==9381==
==9381== Possible data race during write of size 4 at 0x6575EA0 by thread #19
==9381== Locks held: none
==9381==    at 0x53DFD88: mkl_serv_malloc (in /opt/intel/composer_xe_2013.5.192/mkl/lib/intel64/libmkl_core.so)
==9381==    by 0x4D846BD: DGETRF (in /opt/intel/composer_xe_2013.5.192/mkl/lib/intel64/libmkl_intel_lp64.so)
==9381==    by 0x458970: pred_y_values (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x46C1C7: lmo_cv_thread (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x4A0BB19: mythread_wrapper (hg_intercepts.c:219)
==9381==    by 0x36B9607850: start_thread (in /lib64/libpthread-2.12.so)
==9381==    by 0x98676FF: ???
==9381==
==9381== This conflicts with a previous write of size 4 by thread #18
==9381== Locks held: none
==9381==    at 0x53DFD88: mkl_serv_malloc (in /opt/intel/composer_xe_2013.5.192/mkl/lib/intel64/libmkl_core.so)
==9381==    by 0x4D846BD: DGETRF (in /opt/intel/composer_xe_2013.5.192/mkl/lib/intel64/libmkl_intel_lp64.so)
==9381==    by 0x458970: pred_y_values (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x46C1C7: lmo_cv_thread (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)
==9381==    by 0x4A0BB19: mythread_wrapper (hg_intercepts.c:219)
==9381==    by 0x36B9607850: start_thread (in /lib64/libpthread-2.12.so)
==9381==    by 0xA2686FF: ???
==9381==
==9381== ----------------------------------------------------------------
==9381==
==9381== Possible data race during write of size 4 at 0x6575EDC by thread #19
==9381== Locks held: none
==9381==    at 0x53E2220: mkl_serv_unlock (in /opt/intel/composer_xe_2013.5.192/mkl/lib/intel64/libmkl_core.so)
==9381==    by 0x7F: ???
==9381==
==9381== This conflicts with a previous write of size 4 by thread #18
==9381== Locks held: none
==9381==    at 0x53E2220: mkl_serv_unlock (in /opt/intel/composer_xe_2013.5.192/mkl/lib/intel64/libmkl_core.so)
==9381==    by 0x7F: ???
==9381==

Gennady_F_Intel · ‎07-24-2013

These cases might be just false positives. We have noticed the similar cases and sometimes valgrind team confirmed these cases. I would recomend you to check the problem with Intel Inspector ( http://software.intel.com/en-us/intel-inspector-xe )- you can try evaluate it and check the problem with this tool.

SergeyKostrov · ‎07-24-2013

>>... >>...==9381== Possible data race during read of size 8 at 0x7E4B20 by thread #2 >>... It doesn't say if another OS or MKL function(s) were called ( not just cblas_dgemm ) or if that data race happened inside of some another OS or MKL function(s). So, what exactly has to be done after reviewing your Valgrind report? I tested cblas_dgemm recently for a number of threads exceeding 64 and I have not detected any performance issues ( threads scalability of the function is very good ).

Paolo_Tosco · ‎07-24-2013

==9381== Possible data race during read of size 8 at 0x7E4B20 by thread #2
==9381== Locks held: none
==9381== at 0x4FB9C0: __svml_rint2 (in /home/ptosco/open3dtools/bin/open3dqsar.icc2013)

Dear Sergey,

doesn't this mean that the data race was inside svml_rint2()? I don't get this warning when I build with gcc (which uses up glibc rint()). So I thought this was related to ICC libs, though I am pretty sure this is harmless, as also Gennady said.

Cheers,
p.

SergeyKostrov · ‎07-24-2013

>>...doesn't this mean that the data race was inside svml_rint2()?.. Unfortunately, I can not confirm it since I do not use Valgrind software. Did you have a chance to test performance of single-threaded vs. multi-threaded version of cblas_dgemm MKL function? If, in case of multi-threaded version, performance speed up is less than 4x then there is a problem. Please report your performance numbers. Thanks in advance.

Paolo_Tosco · ‎07-24-2013

Dear Sergey,

I am using single-threaded MKL since I set OMP_NUM_THREADS=1 and MKL_NUM_THREADS=1. There are no performance issues, and cblas_dgemm is not involved in the valgrind complaints, which are only about svml_rint2 and dgetrf. The problem with cblas_dgemm was solved by updating to the latest MKL library version. Here I was just wondering if some multithreading-related issue was still present (in spite of correctly computed results) since I got a few warnings by Valgrind. But admittedly Valgrind complaints often with Intel binaries which work just fine, so I guess they are false alarms. I think Valgrind works best with gcc-compiled binaries.

Thanks for your interest in this matter, best regards
Paolo