Solved: Size limits of MKL_DIRECT_CALL

ivanp · ‎06-26-2024

I'm attempting to use MKL_DIRECT_CALL to speed-up solving a number of small linear systems. The documentation of this feature (https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-linux/2024-2/improving-performance-for-small-size-problems.html) states that:

"Intel® oneAPI Math Kernel Library (oneMKL) skips error checking and intermediate function calls if the problem size is small enough (for example: a call to a function that supports direct call, such as dgemm, with matrix ranks smaller than 50)." [emphasis added]

However, if we take a look at the following function,

// dc_dgesv.cpp

#define MKL_DIRECT_CALL_SEQ
#include <mkl.h>

void dc_dgesv_(int *n, 
               int *nrhs, 
               double *a, 
               int *lda, 
               int *ipiv,
               double *b,
               int *ldb
               int *info ) {

    dgetrf(n,n,a,lda,ipiv,info);
    if (*info == 0) {
        char trans = 'N';
        dgetrs(&trans,n,nrhs,a,lda,ipiv,b,ldb,info);
    }

}

after running the program through the preprocessor (icpc -E dc_dgesv.c) and adjusting the indentation for clarity we can see the substitution,

void dc_dgesv_(int *n, 
               int *nrhs, 
               double *a, 
               int *lda, 
               int *ipiv,
               double *b,
               int *ldb
               int *info ) {

    do { 
      const int temp_m = *(n); 
      const int temp_n = *(n); 
      const int temp_lda = *(lda); 
      if ((((temp_m) <= 5) && ((temp_n) <= 5) && 1)) { 
        mkl_dc_dgetrf(temp_m, temp_n, (a), temp_lda, (ipiv), (info)); 
      } else { 
        dgetrf((n), (n), (a), (lda), (ipiv), (info)); 
      } 
    } while (0);

    if (*info == 0) {
        char trans = 'N';
        do { 
          const int temp_n = *(n); 
          const int temp_nrhs = *(nrhs); 
          if ((((temp_n) <= 5) && ((temp_nrhs) <= 10) && 1)) { 
            const char temp_trans = *(&trans); 
            const int temp_lda = *(lda); 
            const int temp_ldb = *(ldb); 
            mkl_dc_dgetrs(temp_trans, temp_n, temp_nrhs, (a), temp_lda, (ipiv), (b), temp_ldb, (info)); 
          } else { 
            dgetrs((&trans), (n), (nrhs), (a), (lda), (ipiv), (b), (ldb), (info)); 
          } 
        } while (0);
    }
}

Notably, the direct call only occurs when both dimensions are smaller or equal to 5 (and not rank 50).

This brings me to my questions:

Does the rank 50 apply to the other methods (e.g. the BLAS operations like ?gemm, but not LAPACK routines like ?getrf/?getrs)?
Do the limits change between versions of oneMKL?
Would the batched LAPACK routines(getrf_batch_xx, getrs_batch_xx) be suitable for this type of problem?
Are there any JIT approaches I could use to speed-up dgetrf/dgetrs?

For more context on my problem, I'm interested in solving ~10^6 linear systems with sizes in the range from 20 to 80. Currently I'm using the sequential MKL library, and doing multi-threading at the outer level over the independent systems. This is part of the assembly phase of a sparse matrix problem (the solution of the linear systems are the rows of the sparse matrix).

(p.s. I'm impressed by the preprocessing and function inlining that goes into MKL_DIRECT_CALL )

Mark_L_Intel · ‎10-30-2024

Hello @ivanp,

OpenMP in batched and non-batched APIs

The non-batched dgetrf and dgetrs are automatically threaded internally. However, if they are called from inside an OMP parallel loop, each call will execute with a single thread, and each factorization/solve will have sequential execution.

For the batched APIs, oneMKL launches its own OpenMP parallel region, so multi-threading is done automatically internally. However, users can control the number of threads available to these parallel regions with the OMP_NUM_THREADS or MKL_NUM_THREADS environment variables. If these environment variables are not set, all the threads available on the system are used (unless the workload is small enough, in which case fewer threads are more practical).

The difference in batched and non-batched API timings

The oneMKL engineering team has confirmed the reported discrepancy in execution time. Still, it can only be partially attributed to the performance of the batched APIs compared to the non-batched ones. Other factors in the supplied reproducer also contribute as described in the next section.

First, regarding the performance of oneMKL batched vs. non-batched calls.

You are comparing the elapsed execution time of batched LU factorization and linear solve with the following two approaches:

(1) Calling getrf_batch_strided (batched LU factorization) and getrs_batch_strided (batched linear solve) APIs

(2) Calling (non-batched) getrf (LU factorization) and getrs (linear solve) APIs inside an OpenMP parallel region.

In the first approach above, when getr{f,s}_batch_strided are called in a sequence, as you do in the batched assembly code path of the reproducer, it is expected that calling getr{f,s} within a parallel region (approach 2) may have better performance, mainly because the matrices are small enough to fit in the cache. This approach maintains data locality from the getrf call to the subsequent getrs call. What is meant by the data locality here, is that for any given matrix in the batch, the LU factorization is computed and is kept in cache memory for the subsequent linear solve.

On the other hand, when calling the batched APIs, after a given matrix is LU factorized, it is evicted from the cache memory to make room for the LU factorization of another matrix in the batch (until the full batch is factorized). Then, the evicted matrices must be reloaded from the main memory for the subsequent linear solves, which is slow compared to already having the matrices in cache memory.

This pattern of data access and movement causes some of the discrepancies observed, and we could help this use case by introducing an API that does batch LU factorization and linear solve (gesv_batch_strided) within a single function call.

The internal ticket was created to add the gesv_batch_strided CPU API. This API will help improve performance for use cases where batched LU factorization and solving are needed, as in your case.

Additional factors that contribute to timing differences.

However, optimizations from the above API will NOT address all discrepancies observed in the reproducer because the reproducer code includes additional factors unrelated to the batched API performance.

You may verify that by replacing the batched calls of getr{f,s}_batch_strided (in rbfx_assembly.F90 Lines 243-248) with the non-batched equivalents inside a parallel loop (as is done in the non-batched assembly code path)

will still have a similar discrepancy in time elapsed between the two code paths (and this discrepancy is independent of batch vs non-batch calls to MKL), i.e., engineering team used this code to verify this:

!$omp parallel do                                                               
do i = 1, batch_size                                                            
    call dgetrf(m,n,A(1,1+(i-1)*n),ld,ipiv(1,i),info(i))                        
    call dgetrs('N',n,nq,A(1,1+(i-1)*n),ld,ipiv(1,i),B(1,1+(i-1)*nq),ld,info(i))
end do                                                                          
!$omp end parallel do

This discrepancy can be attributed to the data movement in the reproducer. The batched assembly code path in the reproducer launches separate parallel regions for the pre-processing and post-processing of data/results (the batched calls to MKL also have their parallel regions).

Each of these parallel regions incurs additional overhead from launching new parallel regions for each section and also loses data locality as execution goes from one parallel region to the next (similar to the discussion above). So it's expected that the batched assembly approach should be slower than what is done in the non-batched assembly code path, which performs all operations (pre-processing, LU factorization/solve, and post-processing) within a single parallel region (which maintains data locality from one section to the next).

Generally, with this type of workload, where batched operations are made in sequence, we recommend that the customers continue using approach 2 because it keeps data hot in the cache throughout the pre-processing, LU factorization+solve, and post-processing stages.

The engineering team is considering the optimizations for getr{f,s} for small matrix sizes used in this workload, and this would benefit your use case more than optimization to the batched APIs.

NUMA considerations

The engineering team observed a significant speed-up in both approaches when NUMA controls were used. If the platform has multiple NUMA nodes, you might consider using these controls.

View solution in original post

ivanp · ‎07-31-2024

Just curious if anyone has had a chance to look at this?

Vipin_S_Intel · ‎08-21-2024

Hi Ivan, we would like to inform you that we are routing your query to the dedicated team for further assistance.

Mark_L_Intel · ‎08-27-2024

Hello @ivanp,

as you can see from documentation: "for example: a call to a function that supports direct call, such as dgemm, with matrix ranks smaller than 50", the parameter 50 is just an example for BLAS operations like ?gemm. For other operations such as ?getrf/?getrs the parameter may be different.
The optimizations (including parameters used) can change between versions.
From what you describe, solving ~10^6 linear systems with sizes in the range from 20 to 80, batched routines designed to handle multiple small linear systems efficiently, can improve performance.
some MKL routines allow just-in-time code generation, see for example, mkl_jit_create_?gemm.

ivanp · ‎09-02-2024

Thanks Mark.

I've implemented a batched version of my assembly procedure, but it runs at only about 2/3 of the speed I was able to achieve with the non-batched LAPACK routines.

Admittedly, I'm testing this on an older Mac with a 6-core Intel i7-9750H, so perhaps the outcome on a newer CPU / GPU or newer MKL version would be different. AFAICU, the batched routines were primarily introduced to reduce overhead and increase parallelism when targeting GPUs.

In the non-batched version, my loop looked like this:

   ! Sparse Matrix Assembly Loop
   ! (loops over stencils corresponding to rows of the sparse matrix)

   !$omp parallel do default(shared) firstprivate(ld,x,y,coeffs,wrk,iwrk) private(ierr)
   do i = 1, nstencils
      ! 1. Gather xy coordinates of the i-th stencil of the unstructured mesh
      call gather_xy(nz,x,y,i,data)
      ! 2. Compute shape functions by solving a small linear system (PHS + poly)
      ! (Internally calls DGETRF/DGETRS from LAPACK)
      call rbf3_poly4(nz,x,y,q,xy_interp(:,1),xy_interp(:,2), &
               coeffs,ld,wrk,iwrk,ierr)
      ! 3. Pack coefficients into CSR matrix
      call pack_coeffs(nz,q,coeffs,ld,i,data)
   end do
   !$omp end parallel do

For the batched version, I wrote corresponding batched routines to create the systems and pack the coefficients (these routines use OMP parallel for internally). This meant delegating the duties slightly differently between sub-procedures. The iteration range [1,nstencils] is broken into batches of size batch_size_max, which I set to 10000. In contrast to the non-batched version, here forking occurs four times:

do k = 1, nstencils, batch_size_max

    print *, "Processing stencils in range ", k, min(k+batch_size_max-1,nstencils)

    remaining = nstencils - k + 1
    batch_size = min(remaining, batch_size_max)
    offset = k - 1

    ! 1. Create batch of matrices and right-hand sides
    call rbfx_batch_shapegen(nz,nq,xq,yq, &
        A,ld,stride_a,B,ld,stride_b, &
        batch_size,offset,gather_xy,data)

    ! 2. Compute shape functions using batched LAPACK routtines
    call dgetrf_batch_strided(n,n,A,ld,stride_a,ipiv,stride_ipiv,batch_size,info)
    call dgetrs_batch_strided('N',n,nq,A,ld,stride_a,ipiv,stride_ipiv, &
        B,ld,stride_b,batch_size,info)

    ! 3. Pack batch of coefficients into CSR matrix
    call pack_coeffs(k,k+batch_size-1,nz,nq,coeffs=B,ld=ld,data=data)

end do

Mark_L_Intel · ‎09-04-2024

Hello @ivanp,

Could you try to tune the batch size and also supply a reproducer?

Please also see attached the reference material that might be helpful.

ivanp · ‎09-09-2024

Thanks for the spec. It doesn't provide much guidelines on the batch size - those are probably hardware dependent anyways, but it does show that some time (and energy) savings are possible with this approach. I've created a standalone example in the attached archive.

I've found that the batch size does certainly have an influence:

~/batched-assembly$ ./scan_batch_size.sh 
Batch Size Elapsed Time (s)     Stencils/s          
10         0.903                232744.504          
20         0.817                257205.403          
50         0.707                297391.510          
80         0.633                331958.476          
100        0.643                326753.241          
200        0.553                379917.871          
500        0.513                409836.057          
800        0.589                356823.029          
1000       0.674                311589.671          
2000       0.853                246323.933          
5000       0.856                245596.488

However it still runs about 20% slower than the non-batched version:

~/batched-assembly$ ./batched_assembly 
 --- IN REGULAR ASSEMBLY ---
 nstencils =       210152
 nz =           21
 nq =            9
 ld =           40
Assembly elapsed time (s):     0.396
Avg. assembly rate (stencils/s): 530094.963
Avg. assembly rate (nnz/s): 11131994.2
 --- EXITING REGULAR ASSEMBLY ---

First and last column of A:
a(1,:)   =     1.0000000     0.6203057     0.6048178     0.6426906     0.5447672     0.4624874     0.3779286     0.2583274     0.3148763
a(nnz,:) =     1.0000000     0.5233069     0.7125179     0.8136800     0.6883008     0.3853362     0.6387614     0.5710853     0.3331630

ivanp · ‎09-09-2024

I've repeated the measurement on an Intel Core i7-11700K using a 2024 oneAPI release. Here I get matching runtimes from the batched and non-batched versions. When using hyper-threading (16 threads) it achieves about ~820k stencils/s. The best batch size is in the range 500-700.

The tests are bit a noisy however so perhaps a bigger dataset is needed. I also need to check with VTune how the elapsed time is distributed between the three phases (matrix generation, batched LU, and packing). In the non-batched version, the majority of time was spent in DGETRF/DGETRS (see VTune snapshot below):

Mark_L_Intel · ‎09-10-2024

Hello @ivanp,

Thank you for providing a reproducer and sharing your results. I was able to build your reproducer using latest oneAPI after replacing classic compilers with new LLVM-based compilers:

#FC=ifort
FC=ifx
#CXX=icpc
CXX=icpx

Hope to see further updates on your experiments maybe with a larger dataset, Vtune data, etc. Meanwhile, I will ask for comments from our development team.

Mark_L_Intel · ‎09-26-2024

Hello @ivanp, just a small update: developer teams may start looking into the effect of different OpenMP scheduling on the batched APIs performance.

Mark_L_Intel · ‎10-30-2024