gemm segfault with simple test in version 10.2.5.035 linux 64

Rodolfo_Gonzalez · ‎06-16-2010

Hi,

We were using 10.1 version of mkl in our products and we are trying to migrate to version 10.2.5.035 but some our test are crashing. After some debuging we have observed that the problem appears in calls to gemm methods (cblas_dgemm for instance). The next code reproduces the problem in our environment:

#include
#include

void multiplyMatrices (double * matrixA, const bool transposeA, double * matrixB,
const bool transposeB, const unsigned int mDim, const unsigned int nDim, const unsigned int kDim,
double * matrixC)
{
CBLAS_ORDER order = CblasColMajor;
CBLAS_TRANSPOSE transA = CblasNoTrans;
CBLAS_TRANSPOSE transB = CblasNoTrans;
int lda = mDim ;
int ldb = kDim ;
int m = (int)mDim;
int n = (int)nDim;
int k = (int)kDim;

if (transposeA)
{
transA = CblasTrans;
lda = kDim ;
}

if (transposeB)
{
transB = CblasTrans;
ldb = nDim ;
}
double alpha = 1.0;
double beta = 0.0;
int ldc = mDim ;

cblas_dgemm (order, transA, transB, m, n, k, alpha, matrixA, lda, matrixB,ldb, beta, matrixC, ldc);
}

int main(int argc, char ** argvc)
{

double matrixA [3]= {3.0, 6.0, 9.0};
double matrixB [3] = {4.0, 5.0, 3.0};
double matrixC1 [9] = {33.3, 33.3, 33.3, 33.3, 33.3, 33.3, 33.3, 33.3, 33.3};
double expectedC1 [9] = {12.0, 24.0, 36.0, 15.0, 30.0, 45.0, 9.0, 18.0, 27.0};
double matrixC2 [1] = {77.7};

multiplyMatrices (matrixA, false, matrixB, false, 1, 1, 3, matrixC2);
//BOOST_CHECK_CLOSE ((double)matrixC2[0], (double)69.0, 0.001);

std::cout << "Matrixc2 " << matrixC2[0] << std::endl;

multiplyMatrices (matrixA, true, matrixB, true, 3, 3, 1, matrixC1);

unsigned int i;
for (i=0; i < 9; i++)
{
//BOOST_CHECK_CLOSE ((double)matrixC1, (double)expectedC1, 0.001);
}
}

Compiling with the next gcc command:

/usr/bin/c++ -Wall -O2 -pipe -I/opt/intel/mkl/10.2.5.035/include ../../src/KMeansTest.cpp -L/opt/intel/mkl/10.2.5.035/lib/em64t -Wl,-Bdynamic -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -Wl,-rpath,/usr/local/lib:/opt/intel/mkl/10.2.5.035/lib/em64t -o test

and executing we get a core with the next stacktrace:

#0 0x00007f59926f234d in mkl_blas_dgemm_mscale () from /opt/intel/mkl/10.2.5.035/lib/em64t/libmkl_mc.so
#1 0x00007f5994b25948 in mkl_blas_dgemm_mscale () from /opt/intel/mkl/10.2.5.035/lib/em64t/libmkl_sequential.so
#2 0x00007f5992892777 in mkl_blas_xdgemm () from /opt/intel/mkl/10.2.5.035/lib/em64t/libmkl_mc.so
#3 0x00007f5994b26f79 in mkl_blas_xdgemm () from /opt/intel/mkl/10.2.5.035/lib/em64t/libmkl_sequential.so
#4 0x00007f5994c18a51 in mkl_blas_dgemm () from /opt/intel/mkl/10.2.5.035/lib/em64t/libmkl_sequential.so
#5 0x00007f599533bbba in dgemm_ () from /opt/intel/mkl/10.2.5.035/lib/em64t/libmkl_intel_ilp64.so
#6 0x00007f599534c439 in cblas_dgemm () from /opt/intel/mkl/10.2.5.035/lib/em64t/libmkl_intel_ilp64.so
#7 0x0000000000400bb3 in multiplyMatrices ()
#8 0x0000000000400dc5 in main ()

We have tested this issue with in different gcc version (4.2.1 and 4.3.3) and different linux distribution (ubuntu and open suse) with same results.

Artem_V_Intel · ‎06-16-2010

Hello,

You should use LP64 interface library in the linking line instead of ILP64.

/usr/bin/c++ -Wall -O2 -pipe -I/opt/intel/mkl/10.2.5.035/include ../../src/KMeansTest.cpp -L/opt/intel/mkl/10.2.5.035/lib/em64t -Wl,-Bdynamic -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -Wl,-rpath,/usr/local/lib:/opt/intel/mkl/10.2.5.035/lib/em64t -o test

MKL ILP64 interface assumes that integer types on the input of MKL functions has 64-bit length, not 32-bit as standard C type int. If you would like to use ILP64 interface for huge array support you should replace all ints with MKL_INT type and define -DMKL_ILP64 during sources compilation.

Best regards,
Artem

sdodds87 · ‎01-16-2011

I've had a similar problem to the original poster, and was hoping for some help. I'm using ifort and blas libraries to solve large systems of equations, and have found that everything worked perfectly when using an older version of compiler/libraries (fce 10.1.015 and mkl 10.0.1.014). I upgraded our compiler/mkl libraries a few months ago and found that when linking with -lmkl_intel_ilp64 I was getting a seg fault in the dgemm routine, as the original post mentioned. When I switched to using the LP64 interface the seg fault went away and the code runs, but there seems to be some roundoff errors that are preventing my Newton's method routine from converging. However, if I run with the old compiler/mkl suite, everything works just fine.

I haven't changed anything in the fortran code that worked with the older version of the compilers, so it's not a problem with the code (I don't think), but is related to some mistake I'm making when linking the libraries. Any thoughts on the root of this problem and what I can do to fix it would be greatly appreciated!

With the older version of the compiler, I'm linking using -L$(MKLROOT)/ -lguide -lpthread -lmkl -lmkl_lapack

With the newer version of the compiler (11.1/046), I'm using -L$(MKLROOT)/ -Wl,--start-group -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -Wl,--end-group -lpthread

as recommended by the mkl link line advisor

TimP · ‎01-16-2011

The link advisor should tell you more than what you quote in your link line. However, if you got a successful link, it's hard to blame numerical changes on the link.
It looks like you are trying to change too many things at one time. I guess you figured out you should postpone changing to 64-bit integer arguments until you have taken the other steps. You should be able to build with the old compiler and the new MKL, or vice versa.
If you were using the default (ia32) option of the 32-bit ifort 10.1, you should try the corresponding -mia32 option of the newer compiler. If your application requires default promotions of single precision expressions to double, this could help you isolate it. Then you should fix it to explicit promotions to double. You should be able to use the -xW option of 10.1 which corresponds with current default. The older MKL would be using SSE2 code already.