Intel® oneAPI HPC Toolkit
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
1828 Discussions

LAPACKE - dgetri and dgetrf for large matrix - avoid std::bad_alloc' error

chris6
Beginner
477 Views

 

Hello,

 

In C++, I need a function which enables the inversion on large matrixes (at the maximum, I need to inverse 120,000 x 120,000 size matrix).

I am trying with the workstation at work to perform this with LAPACKE dgetri and dgetrf routines with Intel OneAPI framework.

The workstation has 1TB of RAM and 2 GPU cards RTX A6000 of 48GB for each one.

Currently, the maximum that I can invert is roughly a matrix of 50,000 x 50,000. Over this size, the 1TB RAM if full and I get the following error :

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Command terminated by signal 6

Here the implemented function which inverses a matrix :

// Inversion Matrix : passing Matrixes by Reference
void matrix_inverse_lapack(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {

  // Index for loop and arrays
  int i, j, ip, idx;

  // Size of F_matrix
  int N = F_matrix.size();
  cout << "m = " << N << endl;

  int *IPIV = new int[N];

  // Statement of main array to inverse
  double *arr = new double[N*N];

  // Statement of returned matrix of size N
  //vector<vector<double> > F_final(N, vector<double>(N));   

  // Output Diagonal block
  double *diag = new double[N];

  //#pragma omp parallel for num_threads(n_threads)
  for (i = 0; i<N; i++){
    for (j = 0; j<N; j++){
      idx = i*N + j;
      arr[idx] = F_matrix[i][j];
    }
  }

  // LAPACKE routines
  int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, arr, N, IPIV);
  int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, arr, N, IPIV);

  //#pragma omp parallel for num_threads(n_threads)
  for (i = 0; i<N; i++){
    for (j = 0; j<N; j++){
      idx = i*N + j;
      F_output[i][j] = arr[idx];
    }
  }

  delete[] IPIV;
  delete[] arr;
} 

Is there a workaround with LAPACKE to be able to invert a 120,000 x 120,000 matrix without having a bad alloc error, even if the RAM of workstation has 1TB?

PS: I have also tried to use MAGMA for GPU cards but I am also limited to 40,000 x 40,000 matrix size for one GPU. I couldn't have for the moment found a way to combine the powerful in the same time of both GPU.

EDIT :

Is there a way to pass by reference the F_matrix input variable as arguments for LAPACKE dgetrf and degetri? I remind that F_matrix has a type vector<vector<double>>.

Thanks in advance for your help.

Best regards

 

0 Kudos
7 Replies
Gennady_F_Intel
Moderator
450 Views

Christophe,

please try to link against MKL ILP64 API. Check what MKL Linker Adviser (https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl/link-line-advisor.html) suggests to link with. In that case, don't forget to change all integers into your code as MKL_INT types.

the second way to solve such huge problems - use the distributed versions of these routines ( p?getri and p?getrf correspondingly) to run across many computing nodes. But in that case, You have to change the int datatypes to MKL_INT as well and link against ILP64 mkl's libraries as well.

-Gennady



chris6
Beginner
396 Views

Hello Gennady,

 

Thanks for your quick answer.

Your first option didn't avoid to get 1TB RAM to be full and error bad alloc .

 

For your second suggestion, there a few examples on the web for pdgetri and pdgetf routines.

How to pass from :

// LAPACKE routines
MKL_INT info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, arr, N, IPIV);
MKL_INT info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, arr, N, IPIV);

 

to pdgetri and pdgetrf ?

 

If someone could help me to implement these new routines for me, I didn't understand all the required arguments.

Best regards, chris

 

 

 

Gennady_F_Intel
Moderator
383 Views

MKL contains the Fortran API of PDGETRF routine example. You may find out this example in MKKLROOT/examples/scalapackf directory. There is no example of p?getri routine available.

 

Meantime, originally I wrongly estimate the memory allocation sizes for 120k x 120 k matrixes.

Actually, the allocated size of double-precision 120kx120k matrix == size * size* sizeof(double) ~ 110 Gb. You could see its feet with your 1TB RAM .

 

I made the trivial dgetrf example ( see attached),

Built it by using -DML_ILP64 compiler option and linked against ILP64 version of mkl libraries:

 

icc -DMKL_ILP64 -I/opt/intel/oneapi/mkl/2021.4.0/include test_getrf_huge.cpp \

-Wl,--start-group \

/opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_ilp64.a \

/opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_intel_thread.a \

/opt/intel/oneapi/mkl/2021.4.0/lib/intel64/libmkl_core.a \

-Wl,--end-group -liomp5 -lpthread -lm -ldl

 

Running this executable on the local AVX-512 based CPU with 256 Gb of RAM, plus MKL_VERBOSE mode, I see the getrf passed with Ok result. The execution time = 426 sec.

 

$ ./a.out 120000

MKL_VERBOSE oneMKL 2021.0 Update 4 Product build 20210904 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost), EVEX-encoded AES and Carry-Less Multiplication Quadword instructions, Lnx 2.40GHz ilp64 intel_thread

MKL_VERBOSE DGETRF(120000,120000,0x150c472aa080,120000,0x1541ed9a9080,0) 426.99s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:72

...LAPACKE_dgetrf returns 0, SIZE = 120000

 

-Gennady



Gennady_F_Intel
Moderator
365 Views

I attached the test I used.

chris6
Beginner
298 Views

Hello Gennady !

Your test code worked fine on the worstation. But runtime is relatively long compared to you (roughly 1800s). I have a processor AMD EPYC 64 cores (128 threads).

During the execution, I can see the 64 threads launched by the code (OMP_NUM_THREADS).

But for my code, RAM becomes quickly full and reach the 1TB, and one gets error "terminate called after throwing an instance of 'std::bad_alloc' ".

I think this is due to the fact I am using in other parts of the code 120k x 120k.

 

Tha's why I would like to pass to scaLAPACK, since I think a distributed computation is necessary in my case.

 

But as you mentioned it, there is only an example in Fortran, not C++.

Do you know by chance other sources or test codes for "pdgetrf" combined with "pdgetri" to carry out a 120k x 120k matrix inversion.

 

Here my last code with your suggestions :

 


// Passing Matrixes by Reference
void matrix_inverse_lapack(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {

// Index for loop and arrays
MKL_INT i, j, ip, idx;

// Size of F_matrix
MKL_INT N = F_matrix.size();
cout << "m = " << N << endl;

//MKL_INT *IPIV = new MKL_INT[N];
MKL_INT *IPIV = (MKL_INT*)mkl_calloc( N, sizeof(MKL_INT), 64);

// Statement of main array to inverse
double *arr = (double*) mkl_calloc( N*N, sizeof(double), 64);

//#pragma omp parallel for num_threads(n_threads)
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
arr[idx] = F_matrix[i][j];
}
}

 

Compiled with intel.make :

 

CXX = icpc -std=c++11 -O3 -xHost -DMKL_ILP64
CXXFLAGS = -Wall -c -I${MKLROOT}/include -I/opt/intel/oneapi/compiler/latest/linux/compiler/include -qopenmp -qmkl=parallel
LDFLAGS = -L${MKLROOT}/lib -Wl,-rpath,${MKLROOT}/lib -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl
SOURCES = main_intel.cpp XSAF_C_intel.cpp
EXECUTABLE = main_intel.exe

 

Any clues or suggestions ?

 

Thanks

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Gennady_F_Intel
Moderator
284 Views

>> I think this is due to the fact I am using in other parts of the code 120k x 120k.

<< that's mean the problem is outside of MKLs calls and therefore you could somehow redesign the memory consumption of this application.


>> That's why I would like to pass to ScaLAPACK, since I think a distributed computation is necessary in my case.

<< I have only two suggestions: 1/ the C code of pdgetrf will look very similar to Fortran counterpart. A similar is true wrt pgetri routine. 2/ you might try to search this forum to see such kinds of examples.



Gennady_F_Intel
Moderator
116 Views

The thread is closing and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.



Reply