MKL implementation includes several approaches and MKL switches between them according to problem size and environment trying to select the best one. Block sizes is also changing if applicable. Parallel implementationis significantly differ from sequential one, it utilizes top level parallelizm in Cholesky algorithm and calling sequential BLAS functions.
You could find out block sizes for sequential version by defaultNETLIB LAPACK way - via ILAENV call like:
NB = ILAENV( 1, 'DPOTRF', UPLO, N, -1, -1, -1 ). However you will see best blocksizes even not for NETLIB code, but for one of MKL code paths. Others are hidden from user.
Generally if you speed up for example Cholesky of128x128 blockyou will not see big speed up for larger matrices. Let's assume sequential version (1 thread) and let's use as example matrix of size 4096. Number of operations for Cholesky is (1/3)n^3. We will have 32 times of 128x128 cholesky for such matrix. This is 32*(1/3)*128^3 operations. Comparing to overall operations count it's just0.098% ( = (32*(1/3)*128^3) / ((1/3)*4096^3)). Even if local factorization for example is ten times slower than optimized DTRSM and DSYRK in terms of efficiency(GFLOPS), it will take 0.97% of overall execution time, and a speed up for it could be justlost in time measurement noise and wouldn't matter much.
However for parallel version these localfactorizations are on critical path, thus for example if DTRSM and DSYRK have 10x speed up on some parallel machine then sequential local factorization will take 8.9% of overall time and its speed up will be noticable. But for parallel version we could interleave local factorization and update operations. And also we could easily decrease the percentage by decreasing block size, although we could loose BLAS pefromance because the bigger is block the better for BLAS.
Hope I answered your question. Feel free to ask more.