Question 1

Hello,

In C++, I need a function which enables the inversion on large matrixes (at the maximum, I need to inverse 120,000 x 120,000 size matrix).

I am trying with the workstation at work to perform this with LAPACKE dgetri and dgetrf routines with Intel OneAPI framework.

The workstation has 1TB of RAM and 2 GPU cards RTX A6000 of 48GB for each one.

Currently, the maximum that I can invert is roughly a matrix of 50,000 x 50,000. Over this size, the 1TB RAM if full and I get the following error :

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Command terminated by signal 6

Here the implemented function which inverses a matrix :

// Inversion Matrix : passing Matrixes by Reference
void matrix_inverse_lapack(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {

  // Index for loop and arrays
  int i, j, ip, idx;

  // Size of F_matrix
  int N = F_matrix.size();
  cout << "m = " << N << endl;

  int *IPIV = new int[N];

  // Statement of main array to inverse
  double *arr = new double[N*N];

  // Statement of returned matrix of size N
  //vector<vector<double> > F_final(N, vector<double>(N));   

  // Output Diagonal block
  double *diag = new double[N];

  //#pragma omp parallel for num_threads(n_threads)
  for (i = 0; i<N; i++){
    for (j = 0; j<N; j++){
      idx = i*N + j;
      arr[idx] = F_matrix[i][j];
    }
  }

  // LAPACKE routines
  int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, arr, N, IPIV);
  int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, arr, N, IPIV);

  //#pragma omp parallel for num_threads(n_threads)
  for (i = 0; i<N; i++){
    for (j = 0; j<N; j++){
      idx = i*N + j;
      F_output[i][j] = arr[idx];
    }
  }

  delete[] IPIV;
  delete[] arr;
}

Is there a workaround with LAPACKE to be able to invert a 120,000 x 120,000 matrix without having a bad alloc error, even if the RAM of workstation has 1TB?

PS: I have also tried to use MAGMA for GPU cards but I am also limited to 40,000 x 40,000 matrix size for one GPU. I couldn't have for the moment found a way to combine the powerful in the same time of both GPU.

EDIT :

Is there a way to pass by reference the F_matrix input variable as arguments for LAPACKE dgetrf and degetri? I remind that F_matrix has a type vector<vector<double>>.

Thanks in advance for your help.

Best regards

Question 2

Christophe,

please try to link against MKL ILP64 API. Check what MKL Linker Adviser (https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl/link-line-advisor.html) suggests to link with. In that case, don't forget to change all integers into your code as MKL_INT types.

the second way to solve such huge problems - use the distributed versions of these routines ( p?getri and p?getrf correspondingly) to run across many computing nodes. But in that case, You have to change the int datatypes to MKL_INT as well and link against ILP64 mkl's libraries as well.

-Gennady

Question 3

Hello Gennady,

Thanks for your quick answer.

Your first option didn't avoid to get 1TB RAM to be full and error bad alloc .

For your second suggestion, there a few examples on the web for pdgetri and pdgetf routines.

How to pass from :

// LAPACKE routines
MKL_INT info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, arr, N, IPIV);
MKL_INT info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, arr, N, IPIV);

to pdgetri and pdgetrf ?

If someone could help me to implement these new routines for me, I didn't understand all the required arguments.

Best regards, chris

Question 4

MKL contains the Fortran API of PDGETRF routine example. You may find out this example in MKKLROOT/examples/scalapackf directory. There is no example of p?getri routine available.

Meantime, originally I wrongly estimate the memory allocation sizes for 120k x 120 k matrixes.

Actually, the allocated size of double-precision 120kx120k matrix == size * size* sizeof(double) ~ 110 Gb. You could see its feet with your 1TB RAM .

I made the trivial dgetrf example ( see attached),

Built it by using -DML_ILP64 compiler option and linked against ILP64 version of mkl libraries:

icc -DMKL_ILP64 -I/opt/intel/oneapi/mkl/2021.4.0/include test_getrf_huge.cpp \

-Wl,--start-group \

/opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_ilp64.a \

/opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_thread.a \

/opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_core.a \

-Wl,--end-group -liomp5 -lpthread -lm -ldl

Running this executable on the local AVX-512 based CPU with 256 Gb of RAM, plus MKL_VERBOSE mode, I see the getrf passed with Ok result. The execution time = 426 sec.

$ ./a.out 120000

MKL_VERBOSE oneMKL 2021.0 Update 4 Product build 20210904 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost), EVEX-encoded AES and Carry-Less Multiplication Quadword instructions, Lnx 2.40GHz ilp64 intel_thread

MKL_VERBOSE DGETRF(120000,120000,0x150c472aa080,120000,0x1541ed9a9080,0) 426.99s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:72

...LAPACKE_dgetrf returns 0, SIZE = 120000

-Gennady

Question 5

I attached the test I used.

Question 6

Hello Gennady !

Your test code worked fine on the worstation. But runtime is relatively long compared to you (roughly 1800s). I have a processor AMD EPYC 64 cores (128 threads).

During the execution, I can see the 64 threads launched by the code (OMP_NUM_THREADS).

But for my code, RAM becomes quickly full and reach the 1TB, and one gets error "terminate called after throwing an instance of 'std::bad_alloc' ".

I think this is due to the fact I am using in other parts of the code 120k x 120k.

Tha's why I would like to pass to scaLAPACK, since I think a distributed computation is necessary in my case.

But as you mentioned it, there is only an example in Fortran, not C++.

Do you know by chance other sources or test codes for "pdgetrf" combined with "pdgetri" to carry out a 120k x 120k matrix inversion.

Here my last code with your suggestions :


// Passing Matrixes by Reference
void matrix_inverse_lapack(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {

// Index for loop and arrays
MKL_INT i, j, ip, idx;

// Size of F_matrix
MKL_INT N = F_matrix.size();
cout << "m = " << N << endl;

//MKL_INT *IPIV = new MKL_INT[N];
MKL_INT *IPIV = (MKL_INT*)mkl_calloc( N, sizeof(MKL_INT), 64);

// Statement of main array to inverse
double *arr = (double*) mkl_calloc( N*N, sizeof(double), 64);

//#pragma omp parallel for num_threads(n_threads)
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
arr[idx] = F_matrix[i][j];
}
}

Compiled with intel.make :

CXX = icpc -std=c++11 -O3 -xHost -DMKL_ILP64
CXXFLAGS = -Wall -c -I${MKLROOT}/include -I/opt/intel/oneapi/compiler/latest/linux/compiler/include -qopenmp -qmkl=parallel
LDFLAGS = -L${MKLROOT}/lib -Wl,-rpath,${MKLROOT}/lib -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl
SOURCES = main_intel.cpp XSAF_C_intel.cpp
EXECUTABLE = main_intel.exe

Any clues or suggestions ?

Thanks

Question 7

>> I think this is due to the fact I am using in other parts of the code 120k x 120k.

<< that's mean the problem is outside of MKLs calls and therefore you could somehow redesign the memory consumption of this application.

>> That's why I would like to pass to ScaLAPACK, since I think a distributed computation is necessary in my case.

<< I have only two suggestions: 1/ the C code of pdgetrf will look very similar to Fortran counterpart. A similar is true wrt pgetri routine. 2/ you might try to search this forum to see such kinds of examples.

Question 8

The thread is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

score 0 · Answer 1

5,894 Views

Christophe,

please try to link against MKL ILP64 API. Check what MKL Linker Adviser (https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl/link-line-advisor.html) suggests to link with. In that case, don't forget to change all integers into your code as MKL_INT types.

the second way to solve such huge problems - use the distributed versions of these routines ( p?getri and p?getrf correspondingly) to run across many computing nodes. But in that case, You have to change the int datatypes to MKL_INT as well and link against ILP64 mkl's libraries as well.

-Gennady

Copy link

score 0 · Answer 2

5,840 Views

Hello Gennady,

Thanks for your quick answer.

Your first option didn't avoid to get 1TB RAM to be full and error bad alloc .

For your second suggestion, there a few examples on the web for pdgetri and pdgetf routines.

How to pass from :

// LAPACKE routines
MKL_INT info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, arr, N, IPIV);
MKL_INT info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, arr, N, IPIV);

to pdgetri and pdgetrf ?

If someone could help me to implement these new routines for me, I didn't understand all the required arguments.

Best regards, chris

Copy link

score 0 · Answer 3

MKL contains the Fortran API of PDGETRF routine example. You may find out this example in MKKLROOT/examples/scalapackf directory. There is no example of p?getri routine available.

Meantime, originally I wrongly estimate the memory allocation sizes for 120k x 120 k matrixes.

Actually, the allocated size of double-precision 120kx120k matrix == size * size* sizeof(double) ~ 110 Gb. You could see its feet with your 1TB RAM .

I made the trivial dgetrf example ( see attached),

Built it by using -DML_ILP64 compiler option and linked against ILP64 version of mkl libraries:

icc -DMKL_ILP64 -I/opt/intel/oneapi/mkl/2021.4.0/include test_getrf_huge.cpp \

-Wl,--start-group \

/opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_ilp64.a \

/opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_thread.a \

/opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_core.a \

-Wl,--end-group -liomp5 -lpthread -lm -ldl

Running this executable on the local AVX-512 based CPU with 256 Gb of RAM, plus MKL_VERBOSE mode, I see the getrf passed with Ok result. The execution time = 426 sec.

$ ./a.out 120000

MKL_VERBOSE oneMKL 2021.0 Update 4 Product build 20210904 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost), EVEX-encoded AES and Carry-Less Multiplication Quadword instructions, Lnx 2.40GHz ilp64 intel_thread

MKL_VERBOSE DGETRF(120000,120000,0x150c472aa080,120000,0x1541ed9a9080,0) 426.99s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:72

...LAPACKE_dgetrf returns 0, SIZE = 120000

-Gennady

score 0 · Answer 4

5,809 Views

I attached the test I used.

2 KB

Copy link

score 0 · Answer 5

Hello Gennady !

Your test code worked fine on the worstation. But runtime is relatively long compared to you (roughly 1800s). I have a processor AMD EPYC 64 cores (128 threads).

During the execution, I can see the 64 threads launched by the code (OMP_NUM_THREADS).

But for my code, RAM becomes quickly full and reach the 1TB, and one gets error "terminate called after throwing an instance of 'std::bad_alloc' ".

I think this is due to the fact I am using in other parts of the code 120k x 120k.

Tha's why I would like to pass to scaLAPACK, since I think a distributed computation is necessary in my case.

But as you mentioned it, there is only an example in Fortran, not C++.

Do you know by chance other sources or test codes for "pdgetrf" combined with "pdgetri" to carry out a 120k x 120k matrix inversion.

Here my last code with your suggestions :


// Passing Matrixes by Reference
void matrix_inverse_lapack(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {

// Index for loop and arrays
MKL_INT i, j, ip, idx;

// Size of F_matrix
MKL_INT N = F_matrix.size();
cout << "m = " << N << endl;

//MKL_INT *IPIV = new MKL_INT[N];
MKL_INT *IPIV = (MKL_INT*)mkl_calloc( N, sizeof(MKL_INT), 64);

// Statement of main array to inverse
double *arr = (double*) mkl_calloc( N*N, sizeof(double), 64);

//#pragma omp parallel for num_threads(n_threads)
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
arr[idx] = F_matrix[i][j];
}
}

Compiled with intel.make :

CXX = icpc -std=c++11 -O3 -xHost -DMKL_ILP64
CXXFLAGS = -Wall -c -I${MKLROOT}/include -I/opt/intel/oneapi/compiler/latest/linux/compiler/include -qopenmp -qmkl=parallel
LDFLAGS = -L${MKLROOT}/lib -Wl,-rpath,${MKLROOT}/lib -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl
SOURCES = main_intel.cpp XSAF_C_intel.cpp
EXECUTABLE = main_intel.exe

Any clues or suggestions ?

Thanks

score 0 · Answer 6

5,728 Views

>> I think this is due to the fact I am using in other parts of the code 120k x 120k.

<< that's mean the problem is outside of MKLs calls and therefore you could somehow redesign the memory consumption of this application.

>> That's why I would like to pass to ScaLAPACK, since I think a distributed computation is necessary in my case.

<< I have only two suggestions: 1/ the C code of pdgetrf will look very similar to Fortran counterpart. A similar is true wrt pgetri routine. 2/ you might try to search this forum to see such kinds of examples.

Copy link

score 0 · Answer 7

5,560 Views

The thread is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Copy link

LAPACKE - dgetri and dgetrf for large matrix - avoid std::bad_alloc' error

EDIT :