I am running experiments with dportf and doing some bibliography and I had a few questions / conjectures.
1. Litterature about Cholesky shows blocked versions and recursive block versions of Cholesky are the fastest.
I am assuming the MKL implementation follows this pattern: BIG dportf = smaller dpotrf + dtrsm + dsyrk
2. Then I was trying to determine the dpotrf block size in the decomposition. I conjectured it is the biggest power of 2 for which I would not see a substantial gain from 1 to 2 threads. On my Xeon 5520 which has 256K of L2 per core, I found a 128 x 128 block size for the dpotrf which takes more than half the L2 size.
3. Then, I conjecture if I were able to generate a faster 128x128 cholesky, and plugged it back into the mkl somehow, I should see some speedup.
Does this seem reasonable ?
MKL implementation includes several approaches and MKL switches between them according to problem size and environment trying to select the best one. Block sizes is also changing if applicable. Parallel implementationis significantly differ from sequential one, it utilizes top level parallelizm in Cholesky algorithm and calling sequential BLAS functions.
You could find out block sizes for sequential version by defaultNETLIB LAPACK way - via ILAENV call like:
NB = ILAENV( 1, 'DPOTRF', UPLO, N, -1, -1, -1 ). However you will see best blocksizes even not for NETLIB code, but for one of MKL code paths. Others are hidden from user.
Generally if you speed up for example Cholesky of128x128 blockyou will not see big speed up for larger matrices. Let's assume sequential version (1 thread) and let's use as example matrix of size 4096. Number of operations for Cholesky is (1/3)n^3. We will have 32 times of 128x128 cholesky for such matrix. This is 32*(1/3)*128^3 operations. Comparing to overall operations count it's just0.098% ( = (32*(1/3)*128^3) / ((1/3)*4096^3)). Even if local factorization for example is ten times slower than optimized DTRSM and DSYRK in terms of efficiency(GFLOPS), it will take 0.97% of overall execution time, and a speed up for it could be justlost in time measurement noise and wouldn't matter much.
However for parallel version these localfactorizations are on critical path, thus for example if DTRSM and DSYRK have 10x speed up on some parallel machine then sequential local factorization will take 8.9% of overall time and its speed up will be noticable. But for parallel version we could interleave local factorization and update operations. And also we could easily decrease the percentage by decreasing block size, although we could loose BLAS pefromance because the bigger is block the better for BLAS.
Hope I answered your question. Feel free to ask more.