I've been doing some simple tests, turning on MKL_VERBOSE and comparing the times reported there against times measured in Python. My question is about the observed overheads.
The setup I used to generate these results is Ivy Bridge (2x12 cores, 24 cores total) on Edison at NERSC. I can run on Haswell too and I recall seeing similar overheads. I've installed IDP through Anaconda (Python 2.7) following these instructions:
See the attached script. For a 5k x 5k matrix, the following was typical output:
MKL_VERBOSE DSYRK(L,N,5000,5000,0x7fffffff4c70,0x2aaaf0000010,5000,0x7fffffff4c78,0x2aaafbebd010,5000) 335.77ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:24 WDiv:HOST:+0.000 MKL_VERBOSE DPOTRF(U,5000,0x2aab07d7a010,5000,0) 103.63ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:24 WDiv:HOST:+0.000 PYTHON_TIME 0.607532024384
The numbers from MKL_VERBOSE are obviously in ms, and my number (PYTHON_TIME, in seconds) is 6x longer. I've experimented with placing the time() calls directly around the call to potrf inside the scipy source but it just gives me the same answer. I haven't tried to dig deeper, and I am just wondering if (a) I am interpreting the MKL_VERBOSE result properly, and (b) whether you can suggest where the overheads is coming from (they don't appear to be fixed, they vary with matrix size).
The scipy.linalg.cholesky is implemented via a call into Fortran code (see scipy/linalg/decomp_cholesky.py#L27). The input matrix, which is generated in the C layout thus needs to be converted to the F-layout.
Changing the function in your benchmark to convert the input to F-layout prior to measuring timing of the Cholesky decomposition like so:
def one_iteration( size ) : A = numpy.random.uniform( size = ( size, size ) ) ATA = numpy.dot(A.T, A).copy(order='F') start = time.time() scipy.linalg.cholesky( ATA, overwrite_a = True, check_finite = False ) stop = time.time() return stop - start
results in better agreement between MKL's and Python's time measurements.
MKL_VERBOSE DPOTRF(U,10000,0x7fd1c050f010,10000,0) 1.64s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16 WDiv:HOST:+0.000 PYTHON_TIME 1.7553684711456299