Size limits of MKL_DIRECT_CALL

ivanp · ‎06-26-2024

I'm attempting to use MKL_DIRECT_CALL to speed-up solving a number of small linear systems. The documentation of this feature (https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-linux/2024-2/improving-performance-for-small-size-problems.html) states that:

"Intel® oneAPI Math Kernel Library (oneMKL) skips error checking and intermediate function calls if the problem size is small enough (for example: a call to a function that supports direct call, such as dgemm, with matrix ranks smaller than 50)." [emphasis added]

However, if we take a look at the following function,

// dc_dgesv.cpp

#define MKL_DIRECT_CALL_SEQ
#include <mkl.h>

void dc_dgesv_(int *n, 
               int *nrhs, 
               double *a, 
               int *lda, 
               int *ipiv,
               double *b,
               int *ldb
               int *info ) {

    dgetrf(n,n,a,lda,ipiv,info);
    if (*info == 0) {
        char trans = 'N';
        dgetrs(&trans,n,nrhs,a,lda,ipiv,b,ldb,info);
    }

}

after running the program through the preprocessor (icpc -E dc_dgesv.c) and adjusting the indentation for clarity we can see the substitution,

void dc_dgesv_(int *n, 
               int *nrhs, 
               double *a, 
               int *lda, 
               int *ipiv,
               double *b,
               int *ldb
               int *info ) {

    do { 
      const int temp_m = *(n); 
      const int temp_n = *(n); 
      const int temp_lda = *(lda); 
      if ((((temp_m) <= 5) && ((temp_n) <= 5) && 1)) { 
        mkl_dc_dgetrf(temp_m, temp_n, (a), temp_lda, (ipiv), (info)); 
      } else { 
        dgetrf((n), (n), (a), (lda), (ipiv), (info)); 
      } 
    } while (0);

    if (*info == 0) {
        char trans = 'N';
        do { 
          const int temp_n = *(n); 
          const int temp_nrhs = *(nrhs); 
          if ((((temp_n) <= 5) && ((temp_nrhs) <= 10) && 1)) { 
            const char temp_trans = *(&trans); 
            const int temp_lda = *(lda); 
            const int temp_ldb = *(ldb); 
            mkl_dc_dgetrs(temp_trans, temp_n, temp_nrhs, (a), temp_lda, (ipiv), (b), temp_ldb, (info)); 
          } else { 
            dgetrs((&trans), (n), (nrhs), (a), (lda), (ipiv), (b), (ldb), (info)); 
          } 
        } while (0);
    }
}

Notably, the direct call only occurs when both dimensions are smaller or equal to 5 (and not rank 50).

This brings me to my questions:

Does the rank 50 apply to the other methods (e.g. the BLAS operations like ?gemm, but not LAPACK routines like ?getrf/?getrs)?
Do the limits change between versions of oneMKL?
Would the batched LAPACK routines(getrf_batch_xx, getrs_batch_xx) be suitable for this type of problem?
Are there any JIT approaches I could use to speed-up dgetrf/dgetrs?

For more context on my problem, I'm interested in solving ~10^6 linear systems with sizes in the range from 20 to 80. Currently I'm using the sequential MKL library, and doing multi-threading at the outer level over the independent systems. This is part of the assembly phase of a sparse matrix problem (the solution of the linear systems are the rows of the sparse matrix).

(p.s. I'm impressed by the preprocessing and function inlining that goes into MKL_DIRECT_CALL )