I am working with the mkl_sparse_s_mm routine to perform:
C = A * B
where A is a sparse matrix and C,B are dense matrices. I would like to have some details about the algorithm for sparse-dense matrix multiplication implemented by mkl_sparse_s_mm, i.e., if it uses cache aware strategies, specific micro-kernel implementations to fully leverage CPU registers etc..
Just to provide an example, high performance dense matrix multiplication (cblas_gemm) is usually implemented following the block pack algorithm.
As Gennady said, it is a closed information. A general answer is yes, we try to do different (incl. hardware-aware) optimizations on different levels (multi-tthreading, vectorization and assembly) in order to achieve better performance, as we do in general in MKL.
You can find articles about different algorithms which can be used for sparse-dense matrix multiplication.
The sparsity of one of the inputs implies that some optimization techniques used for dense gemm make much less or no sense at all and at the same time, some additional ideas are used. It also helps to think whether the functionality is compute or memory bound.