- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm attempting to use MKL_DIRECT_CALL to speed-up solving a number of small linear systems. The documentation of this feature (https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-linux/2024-2/improving-performance-for-small-size-problems.html) states that:
"Intel® oneAPI Math Kernel Library (oneMKL) skips error checking and intermediate function calls if the problem size is small enough (for example: a call to a function that supports direct call, such as dgemm, with matrix ranks smaller than 50)." [emphasis added]
However, if we take a look at the following function,
// dc_dgesv.cpp
#define MKL_DIRECT_CALL_SEQ
#include <mkl.h>
void dc_dgesv_(int *n,
int *nrhs,
double *a,
int *lda,
int *ipiv,
double *b,
int *ldb
int *info ) {
dgetrf(n,n,a,lda,ipiv,info);
if (*info == 0) {
char trans = 'N';
dgetrs(&trans,n,nrhs,a,lda,ipiv,b,ldb,info);
}
}
after running the program through the preprocessor (icpc -E dc_dgesv.c) and adjusting the indentation for clarity we can see the substitution,
void dc_dgesv_(int *n,
int *nrhs,
double *a,
int *lda,
int *ipiv,
double *b,
int *ldb
int *info ) {
do {
const int temp_m = *(n);
const int temp_n = *(n);
const int temp_lda = *(lda);
if ((((temp_m) <= 5) && ((temp_n) <= 5) && 1)) {
mkl_dc_dgetrf(temp_m, temp_n, (a), temp_lda, (ipiv), (info));
} else {
dgetrf((n), (n), (a), (lda), (ipiv), (info));
}
} while (0);
if (*info == 0) {
char trans = 'N';
do {
const int temp_n = *(n);
const int temp_nrhs = *(nrhs);
if ((((temp_n) <= 5) && ((temp_nrhs) <= 10) && 1)) {
const char temp_trans = *(&trans);
const int temp_lda = *(lda);
const int temp_ldb = *(ldb);
mkl_dc_dgetrs(temp_trans, temp_n, temp_nrhs, (a), temp_lda, (ipiv), (b), temp_ldb, (info));
} else {
dgetrs((&trans), (n), (nrhs), (a), (lda), (ipiv), (b), (ldb), (info));
}
} while (0);
}
}
Notably, the direct call only occurs when both dimensions are smaller or equal to 5 (and not rank 50).
This brings me to my questions:
- Does the rank 50 apply to the other methods (e.g. the BLAS operations like ?gemm, but not LAPACK routines like ?getrf/?getrs)?
- Do the limits change between versions of oneMKL?
- Would the batched LAPACK routines(getrf_batch_xx, getrs_batch_xx) be suitable for this type of problem?
- Are there any JIT approaches I could use to speed-up dgetrf/dgetrs?
For more context on my problem, I'm interested in solving ~10^6 linear systems with sizes in the range from 20 to 80. Currently I'm using the sequential MKL library, and doing multi-threading at the outer level over the independent systems. This is part of the assembly phase of a sparse matrix problem (the solution of the linear systems are the rows of the sparse matrix).
(p.s. I'm impressed by the preprocessing and function inlining that goes into MKL_DIRECT_CALL
Link Copied
![](/skins/images/045A6C88D0527A93E76B179D7F5E2AFE/responsive_peak/images/icon_anonymous_message.png)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page