Solved: CSCMM for Armadillo Sparse dense multiplications

Ramakrishnan_K_ · ‎09-15-2014

Environment : armadillo 4.320.0 and 4.400
Compiler : Intel CPP compiler
OS : Ubuntu 12.04

I am trying to replace the Armadillo's native sparse dense multiplication with Intel MKL's CSCMM call. I wrote the following code.

#include <mkl.h>  
#define ARMA_64BIT_WORD
#include <armadillo>

using namespace std;
using namespace arma;

int  main(int argc, char *argv[])
{
   long long m = atoi(argv[1]);
   long long k = atoi(argv[2]);
   long long n = atoi(argv[3]);
   float density = 0.3;
   sp_fmat A = sprandn<sp_fmat>(m,k,density);
   fmat B = randu<fmat>(k,n);
   fmat C(m,n);
   C.zeros();
 //C = alpha * A * B + beta * C;
 //mkl_scscmm (char *transa, MKL_INT *m, MKL_INT *n, MKL_INT *k, float *alpha, char *matdescra,       
 //float *val, MKL_INT *indx, MKL_INT *pntrb, MKL_INT *pntre, float *b, MKL_INT *ldb, float *beta, 
//float *c, MKL_INT *ldc);
  char transa = 'N';
  float alpha = 1.0;
  float beta = 0.0;
  char* matdescra = "GUUC";
  long long ldb = k;
  long long ldc = m;
  cout << "b4 Input A:" << endl << A;
  cout << "b4 Input B:" << endl << B;
  mkl_scscmm (&transa,&m,&n,&k,&alpha,matdescra,
              const_cast<float *>(A.values), (long long *)A.row_indices,
             (long long *)A.col_ptrs,(long long *)(A.col_ptrs + 1),
             B.memptr(),&ldb,
             &beta, C, &ldc);
  cout << "Input A:" << endl << A;
  cout << "Input B:" << endl << B;
  cout << "Input C:" << endl << C;
  return 0;
}

I compiled the above code and ran it as "./testcscmm 10 4 6". I am getting a segmentation fault (core dumped).

(0, 0)         1.1123
 (4, 0)        -0.3453
 (8, 0)         0.6081
 (1, 1)         0.6410
 (4, 1)        -0.7121
 (5, 1)         1.1592
 (9, 1)        -1.7189
 (0, 2)         0.4175
 (2, 2)        -0.4001
 (4, 2)         2.2809
 (4, 3)        -2.2717
 (9, 3)         0.2251

b4 Input B:
0.1567   0.9989   0.6126   0.4936   0.5267   0.2833
0.4009   0.2183   0.2960   0.9728   0.7699   0.3525
0.1298   0.5129   0.6376   0.2925   0.4002   0.8077
0.1088   0.8391   0.5243   0.7714   0.8915   0.9190
Input A:
[matrix size: 13715672716573367337x13744746204899078486; n_nonzero: 12; density: 0.00%]

Segmentation fault (core dumped)

For some reason the structure of A is getting corrupted. I have the following questions.

Does MKL_CSCMM modify the input array? If not why should A get corrupted?
I changed the matrix C to native float. Still the error persists.
Valgrind shows some memory errors.

Let me know how to make an intel MKL call using Armadillo's matrix data structures. Especially Sparse dense multiplication.

Ramakrishnan_K_ · ‎09-23-2014

Thank you Yin. It worked excellently. I am documenting all the steps so that anyone refers in the future will have complete details.

Building and installing armadillo :

export CXX=icpc
export CC=icpc
export PATH=$PATH:/home/ramki/intel/bin:
Edit $armadillo_root/cmake_aux/Modules/ARMA_FindMKL.cmake, include the PATHS correctly.
Edit $armadillo_root/cmake_aux/Modules/ARMA_FindMKL.cmake, change mkl_lp64 to mkl_ilp64
Edit $armadillo_root/CMakeLists.txt and (1) Change CMAKE_SHARED_LINKER_FLAGS to include the link line by intel link advisor and (2) Change CMAKE_CXX_FLAGS as given by intel link advisor
Run ./configure and make sure MKL library is used for blas and lapack, icpc to be the compiler and the rest to be alright.
Run make .
Verify the linked libraries by running ldd libarmadillo.so. Mainly verify whether it is linked with mkl_ilp64 library and mkl blas and lapack libraries.
Now run make install DESTDIR=local path.

In the C++ program

Don't type case the const ptr of A.values to ptr A.values using const_cast function. It distorts the pointer and not sure whether we are passing the right pointer to cscmm. Instead duplicate just the values of A alone in a separate array.
Make sure you are setting the correct ldb and ldc.
pntrb and pntre could be A.col_ptrs and A.col_ptrs+1.
Use MKL_INT in the place of long long.

Hope this helps everyone.

View solution in original post

Ying_H_Intel · ‎09-22-2014

Hi Ramakrishnan K.

The function doesn't not modify the input array.

The problem is in the ldb and ldc, which is columns (in C array, zero-based). You may change them as ldb = n, ldb=c and get the result.

ldb : INTEGER. Specifies the leading dimension of b for one-based indexing, and

the second dimension of b for zero-based indexing, as declared in the
calling (sub)program

Best Regards,

Ying

( P.S I have issue to build with -Dmkl_ilp64 as i build libarmadillo.so with default option, it link mkl_lp64 libraries as below, it is conflict with -Dmkl_ilp64

if i change all of long long to MKL_INT and build with -mkl ( lp64), the program can run fine, please let me know if you get any result)

[yhu5@snb01 Debug]$ ldd /home/yhu5/armadillo-4.320.2/libarmadillo.so
linux-vdso.so.1 => (0x00007fffef75b000)
libmkl_intel_thread.so => /opt/intel/mkl/lib/intel64/libmkl_intel_thread.so (0x00007ffc58a14000)
libmkl_core.so => /opt/intel/mkl/lib/intel64/libmkl_core.so (0x00007ffc57343000)
libmkl_intel_lp64.so => /opt/intel/mkl/lib/intel64/libmkl_intel_lp64.so (0x00007ffc56a3b000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007ffc5671f000)
libm.so.6 => /lib64/libm.so.6 (0x00007ffc5649a000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007ffc56284000)
libc.so.6 => /lib64/libc.so.6 (0x00007ffc55f05000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007ffc55d00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003b8ea00000)

icc -I/home/yhu5/armadillo-4.320.2/include -L/home/yhu5/armadillo-4.320.2 -o "testprograms_LP64" ../cscmmtest.cpp -larmadillo -mkl -g -O0 -g3 -Wall -Wextra -pedantic

[yhu5@snb01 Debug]$ ./testprograms_LP64 3 2 1
b4 Input A:
[matrix size: 3x2; n_nonzero: 2; density: 33.33%]

(0, 0) 1.1123
(2, 1) -0.3453

b4 Input B:
0.7831
0.7984

Input A:
[matrix size: 3x2; n_nonzero: 2; density: 33.33%]

(0, 0) 1.1123
(2, 1) -0.3453

Input B:
0.7831
0.7984
Output C:
0.8710
0
-0.2757

Ramakrishnan_K_ · ‎09-23-2014

Thank you Yin. It worked excellently. I am documenting all the steps so that anyone refers in the future will have complete details.

Building and installing armadillo :

export CXX=icpc
export CC=icpc
export PATH=$PATH:/home/ramki/intel/bin:
Edit $armadillo_root/cmake_aux/Modules/ARMA_FindMKL.cmake, include the PATHS correctly.
Edit $armadillo_root/cmake_aux/Modules/ARMA_FindMKL.cmake, change mkl_lp64 to mkl_ilp64
Edit $armadillo_root/CMakeLists.txt and (1) Change CMAKE_SHARED_LINKER_FLAGS to include the link line by intel link advisor and (2) Change CMAKE_CXX_FLAGS as given by intel link advisor
Run ./configure and make sure MKL library is used for blas and lapack, icpc to be the compiler and the rest to be alright.
Run make .
Verify the linked libraries by running ldd libarmadillo.so. Mainly verify whether it is linked with mkl_ilp64 library and mkl blas and lapack libraries.
Now run make install DESTDIR=local path.

In the C++ program

Don't type case the const ptr of A.values to ptr A.values using const_cast function. It distorts the pointer and not sure whether we are passing the right pointer to cscmm. Instead duplicate just the values of A alone in a separate array.
Make sure you are setting the correct ldb and ldc.
pntrb and pntre could be A.col_ptrs and A.col_ptrs+1.
Use MKL_INT in the place of long long.

Hope this helps everyone.

Ying_H_Intel · ‎09-24-2014

Hi Ramkrishnan,

Thanks you very much for the details and tech sharing. I mark your answer as best reply.

Best Regards,

Ying

Ramakrishnan_K_ · ‎12-07-2014

Hi Ying,

I tested my algorithm with the cscmm and my algorithm fails because there appears to be something wrong with the mkl_cscmm call. I load a sparse identify matrix (small eye.mm) from the file and multiply with a random matrix. I am getting the wrong output.

For your reference, I am attaching the source code along with the sample file. Kindly let me know where is the problem. I changed the ldb and ldc with both the number of rows(m) and number of columns (.

./testprograms smalleye.mm 7
LoadMatrixMarketFile for file=smalleye.mm
mm file height=5 width=5 nnz=5
start loading the mm file
location=2x5 VAL=5x1
completed reading the file
CNorm=3.74208
DNorm=3.45878
diffs=35x1
Input A:
[matrix size: 5x5; n_nonzero: 5; density: 20.00%]

(0, 0) 1.0000
(1, 1) 1.0000
(2, 2) 1.0000
(3, 3) 1.0000
(4, 4) 1.0000

Input B:
0.8402 0.1976 0.4774 0.9162 0.0163 0.4009 0.5129
0.3944 0.3352 0.6289 0.6357 0.2429 0.1298 0.8391
0.7831 0.7682 0.3648 0.7173 0.1372 0.1088 0.6126
0.7984 0.2778 0.5134 0.1416 0.8042 0.9989 0.2960
0.9116 0.5540 0.9522 0.6070 0.1567 0.2183 0.6376
output C:
1.7924 0.8045 0.8391 0 0 0 0
1.3106 0.3515 0.6126 0 0 0 0
1.4188 0.9989 0.2960 0 0 0 0
1.5157 0.2183 0.6376 0 0 0 0
1.0532 0.5129 0 0 0 0 0
ArmaD:
0.8402 0.1976 0.4774 0.9162 0.0163 0.4009 0.5129
0.3944 0.3352 0.6289 0.6357 0.2429 0.1298 0.8391
0.7831 0.7682 0.3648 0.7173 0.1372 0.1088 0.6126
0.7984 0.2778 0.5134 0.1416 0.8042 0.9989 0.2960
0.9116 0.5540 0.9522 0.6070 0.1567 0.2183 0.6376

Ramki

Ying_H_Intel · ‎12-08-2014

Hi Ramakrishnan K.

I tried your code with lp64 armadillo ( the build last time)

I comment out the line //#define ARMA_64BIT_WORD in utils.h , then build with the below command. Everything looks fine. I can get right result.

I'm not sure if there is ilp64 problem, i will check it with C code.

Or there is some implicit problem with long long operation in your code. you may check it in details.

for example, try the pionter (MKL_INT *)(A.col_ptrs + 1) to ((MKL_INT *)(A.col_ptrs) + 1))

Best Regards,

Ying

[yhu5@snb01 cscmmtest]$ icc -I/home/yhu5/armadillo-4.320.2/include -I. -L/home/yhu5/armadillo-4.320.2 -o "cscmmtest" matrix_market_file.cpp cscmmtest.cpp -larmadillo -g -O0 -g3 -Wall -Wextra -pedantic -openmp
[yhu5@snb01 cscmmtest]$ ./cscmmtest smalleye.mm 7
LoadMatrixMarketFile for file=smalleye.mm
mm file height=5 width=5 nnz=5
start loading the mm file
location=2x5 VAL=5x1
completed reading the file
Avaldup
Avaldup
Avaldup
111Avaldup
1Avaldup
1CNorm=3.45878
DNorm=3.45878
diffs=0x1
Input A:
[matrix size: 5x5; n_nonzero: 5; density: 20.00%]

(0, 0) 1.0000
(1, 1) 1.0000
(2, 2) 1.0000
(3, 3) 1.0000
(4, 4) 1.0000

Input B:
0.8402 0.1976 0.4774 0.9162 0.0163 0.4009 0.5129
0.3944 0.3352 0.6289 0.6357 0.2429 0.1298 0.8391
0.7831 0.7682 0.3648 0.7173 0.1372 0.1088 0.6126
0.7984 0.2778 0.5134 0.1416 0.8042 0.9989 0.2960
0.9116 0.5540 0.9522 0.6070 0.1567 0.2183 0.6376
output C:
0.8402 0.1976 0.4774 0.9162 0.0163 0.4009 0.5129
0.3944 0.3352 0.6289 0.6357 0.2429 0.1298 0.8391
0.7831 0.7682 0.3648 0.7173 0.1372 0.1088 0.6126
0.7984 0.2778 0.5134 0.1416 0.8042 0.9989 0.2960
0.9116 0.5540 0.9522 0.6070 0.1567 0.2183 0.6376
ArmaD:
0.8402 0.1976 0.4774 0.9162 0.0163 0.4009 0.5129
0.3944 0.3352 0.6289 0.6357 0.2429 0.1298 0.8391
0.7831 0.7682 0.3648 0.7173 0.1372 0.1088 0.6126
0.7984 0.2778 0.5134 0.1416 0.8042 0.9989 0.2960
0.9116 0.5540 0.9522 0.6070 0.1567 0.2183 0.6376

Ying_H_Intel · ‎12-08-2014

And i check the common c code. the ILP 64 bit works too. So the problem is in the transaction. Please check the pointer one by one.

Best Regards,

Ying

build it in VS studio 2010, /D "MKL_ILP64" link with mkl_intel_ilp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib.
#include <stdio.h>
#include "mkl_types.h"
#include "mkl_spblas.h"

int main () {
//*******************************************************************************
// Definition arrays for sparse representation of the matrix A in
// the compressed sparse column format:
//*******************************************************************************
#define M 5
#define NNZ 5
#define MNEW 3
       MKL_INT       m = M, nnz = NNZ, mnew = MNEW;
float       values[NNZ]   = {1.0, 1.0, 1.0, 1.0, 1.0};
       MKL_INT       rows[NNZ]   = {0, 1, 2, 3, 4};
       MKL_INT       colIndex[M+1] = {0, 1, 2, 3, 4, 5};
MKL_INT       pointerB , pointerE;
       MKL_INT       i, j, k;

//*******************************************************************************
// Declaration of local variables :
//*******************************************************************************
#define N 7
MKL_INT       n = N;
float       sol   =    { 0.8402, 0.1976 , 0.4774 , 0.9162 , 0.0163, 0.4009 , 0.5129,
0.3944, 0.3352, 0.6289, 0.6357, 0.2429, 0.1298 , 0.8391,
0.7831 , 0.7682, 0.3648, 0.7173, 0.1372 , 0.1088, 0.6126,
0.7984, 0.2778, 0.5134 , 0.1416 , 0.8042, 0.9989 , 0.2960,
0.9116, 0.5540 , 0.9522 , 0.6070 , 0.1567 , 0.2183 , 0.6376};
       float       rhs   = {0.0};

char transa = 'N';
float alpha = 1.0;
float beta = 0.0;
char* matdescra = "GUNC";
MKL_INT ldb = n;
MKL_INT ldc = n;

printf("\n EXAMPLE PROGRAM FOR COMPRESSED SPARSE ROW FORMAT ROUTINES \n");

//*******************************************************************************
//Task 1. Obtain matrix-matrix multiply (L+D)' *sol --> rhs
// and solve triangular system (L+D)' *temp = rhs with multiple right hand sides
// Array temp must be equal to the array sol
//*******************************************************************************
printf(" \n");
printf(" INPUT DATA FOR MKL_SCSCMM \n");
printf(" WITH TRIANGULAR MATRIX \n");
printf(" M = %1.1i N = %1.1i\n", m, n);
printf(" ALPHA = %4.1f BETA = %4.1f \n", alpha, beta);
printf(" TRANS = '%c' \n", 'T');
printf(" Input matrix \n");
       for (i = 0; i < m; i++) {
           for (j = 0; j < n; j++) {
               printf("%7.3f", sol);
           };
           printf("\n");
       };

MKL_INT * colT = colIndex+1;

mkl_scscmm(&transa, &m, &n, &m, &alpha, matdescra, values, rows, colIndex, colT, &(sol[0][0]), &n, &beta, &(rhs[0][0]), &n);

       printf(" \n");
printf(" OUTPUT DATA FOR MKL_SCSCMM\n");
printf(" WITH TRIANGULAR MATRIX \n");
       for (i = 0; i < m; i++) {
           for (j = 0; j < n; j++) {
               printf("%7.3f", rhs);
           };
           printf("\n");
       };

fflush(stdout);
// sleep(2);
//C.save("out.txt",arma_ascii);

return 0;
}

EXAMPLE PROGRAM FOR COMPRESSED SPARSE ROW FORMAT R

INPUT DATA FOR MKL_SCSCMM
WITH TRIANGULAR MATRIX
M = 5 N = 7
ALPHA = 1.0 BETA = 0.0
TRANS = 'T'
Input matrix
0.840 0.198 0.477 0.916 0.016 0.401 0.513
0.394 0.335 0.629 0.636 0.243 0.130 0.839
0.783 0.768 0.365 0.717 0.137 0.109 0.613
0.798 0.278 0.513 0.142 0.804 0.999 0.296
0.912 0.554 0.952 0.607 0.157 0.218 0.638

OUTPUT DATA FOR MKL_SCSCMM
WITH TRIANGULAR MATRIX
0.840 0.198 0.477 0.916 0.016 0.401 0.513
0.394 0.335 0.629 0.636 0.243 0.130 0.839
0.783 0.768 0.365 0.717 0.137 0.109 0.613
0.798 0.278 0.513 0.142 0.804 0.999 0.296
0.912 0.554 0.952 0.607 0.157 0.218 0.638

Ramakrishnan_K_ · ‎12-09-2014

Hi Ying,

Very interesting observation. You are right. If I disable ARMA_64BIT_WORD and remove -DMKL_ILP64 while compiling everything works fine. It does not work good with both enabled. Armadillo for sparse representation stores row_indices, pntrb and pntre as unsigned long long * and MKL needs const long long *. So type casting should work and for some reason it is failing.

I will investigate further and let you know.

Ramki

Ramakrishnan_K_ · ‎12-16-2014

Hi Ying,

There is one problem in using armadillo's matrix for mkl. Armadillo is using column major ordering. However, MKL is expecting a row major order. But even after fixing this problem, while enabling MKL_ILP64 there is some problem. I am providing with two source files along with this post. You can replace these files in the tar files attached in the previous message. The cscmmtest.cpp, compares the armadillo's sparse multiplication with the mkl_scscmm implementation. I modified the input matrix of your code posted in your message 12/8 and name this as testmklcscmm.cpp. I could not upload a hop file. Hence rename the attached MKLSparseRepresentation.h as MKLSparseRepresentation.hpp before compiling. Disable -DMKL_ILP64 and every thing works fine. Enable MKL_ILP64 and find some problem with mkl_scscmm. Interesting the observation on the mkl_scscmm output is that, some rows matches with the armadillo multiplication, one row will be the sum of the rows and rest will be zero. For eg., in the case of a 3x3 output matrix, 2nd row will be correct output, the sum of rows 1&2 appears as row1 and third row of zeros.

MKL output C:
0.9857 0.4139 0.6839 <----- sum of rows 1&2
0.6253 0.2625 0.4338 <------ correct row 3.
0 0 0
ArmaD:
0.6708 0.2817 0.4654
0.3149 0.1322 0.2185
0.6253 0.2625 0.4338

Please let us know where we are making the mistake. Appreciate your response.

Ramki

Ying_H_Intel · ‎12-21-2014

Hi Ramki,

I rebuild the Armadillo with mkl ilp64 bit. The processing is

Edit $armadillo_root/build_aux/cmake/Modules/ARMA_FindMKL.cmake, change mkl_intel_lp64 to mkl_intel_ilp64

$>source /opt/intel/composer_xe_2013.5.192/bin/compilervars.sh intel64

>./configure

(it shows FOUND MKL, Compiler = GNU)

>./make

Then i build the test code you attached. As you see as below, everything runs fine.

So mkl lp64 library work with lp 64 code. and mkl ilp64 library work with ilp64 (ARMA_64BIT_WORD) code. If mix them, then i get core dumping. For most of users, the default setting build (lp64) should be work without problem.

Here is the test process. I also attach the whole code and exe file for your reference.

vi utils.h (change back the line #define ARMA_64BIT_WORD in utils.h )

>icc -I/home/yhu5/armd_ilp64/armadillo-4.320.2/include -L/home/yhu5/armd_ilp64/armadillo-4.320.2 -o "mklprograms_iLP64" testmklscscmm.cpp -larmadillo -g -O0 -g3 -Wall -Wextra -pedantic -DMKL_ILP64

vi cscmmtest.cpp (change int cscmmtest(int argc, char *argv[]), to int main(int argc, char *argv[]) ( for easy compiler)

>icc -I/home/yhu5/armd_ilp64/armadillo-4.320.2/include -L/home/yhu5/armd_ilp64/armadillo-4.320.2 -o "Aprograms_iLP64" cscmmtest.cpp matrix_market_file.cpp -larmadillo -g -O0 -g3 -Wall -Wextra -pedantic -DMKL_ILP64

>

[yhu5@snb01 armd_ilp64]$ ./mklprograms_iLP64

EXAMPLE PROGRAM FOR COMPRESSED SPARSE ROW FORMAT ROUTINES

INPUT DATA FOR MKL_SCSCMM
WITH TRIANGULAR MATRIX
M = 3 N = 3
ALPHA = 1.0 BETA = 0.0
TRANS = 'T'
Input matrix
0.798 0.335 0.554
0.912 0.768 0.477
0.198 0.278 0.629

OUTPUT DATA FOR MKL_SCSCMM
WITH TRIANGULAR MATRIX
0.671 0.282 0.465
0.315 0.132 0.218
0.625 0.263 0.434

[yhu5@snb01 armd_ilp64]$ ./Aprograms_iLP64 3 3 3

MKL Time : 0.00273395
CNorm=1.24836
Arma time:5.96046e-06
DNorm=1.24836
diffs=0x1
Input A:
[matrix size: 3x3; n_nonzero: 3; density: 33.33%]

(0, 0) 0.8402
(1, 0) 0.3944
(2, 0) 0.7831

Input B:
0.7984 0.3352 0.5540
0.9116 0.7682 0.4774
0.1976 0.2778 0.6289
output C:
0.6708 0.2817 0.4654
0.3149 0.1322 0.2185
0.6253 0.2625 0.4338
ArmaD:
0.6708 0.2817 0.4654
0.3149 0.1322 0.2185
0.6253 0.2625 0.4338
Recover A=C*inv(B)
8.4019e-01 2.9802e-08 -1.7881e-07
3.9438e-01 1.4901e-08 -8.9407e-08
7.8310e-01 5.9605e-08 -1.7881e-07

Recover A=armaD*inv(B)
8.4019e-01 2.9802e-08 -1.7881e-07
3.9438e-01 1.4901e-08 -8.9407e-08
7.8310e-01 5.9605e-08 -1.7881e-07

Ying_H_Intel · ‎12-21-2014

Attached the code and binary file