Large overheads on scipy.linalg.cholesky calls

Rollin_T_ · ‎08-10-2016

I've been doing some simple tests, turning on MKL_VERBOSE and comparing the times reported there against times measured in Python. My question is about the observed overheads.

The setup I used to generate these results is Ivy Bridge (2x12 cores, 24 cores total) on Edison at NERSC. I can run on Haswell too and I recall seeing similar overheads. I've installed IDP through Anaconda (Python 2.7) following these instructions:

https://software.intel.com/en-us/articles/using-intel-distribution-for-python-with-anaconda

See the attached script. For a 5k x 5k matrix, the following was typical output:

MKL_VERBOSE DSYRK(L,N,5000,5000,0x7fffffff4c70,0x2aaaf0000010,5000,0x7fffffff4c78,0x2aaafbebd010,5000) 335.77ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:24 WDiv:HOST:+0.000
MKL_VERBOSE DPOTRF(U,5000,0x2aab07d7a010,5000,0) 103.63ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:24 WDiv:HOST:+0.000
PYTHON_TIME 0.607532024384

The numbers from MKL_VERBOSE are obviously in ms, and my number (PYTHON_TIME, in seconds) is 6x longer. I've experimented with placing the time() calls directly around the call to potrf inside the scipy source but it just gives me the same answer. I haven't tried to dig deeper, and I am just wondering if (a) I am interpreting the MKL_VERBOSE result properly, and (b) whether you can suggest where the overheads is coming from (they don't appear to be fixed, they vary with matrix size).

Oleksandr_P_Intel · ‎08-11-2016

The scipy.linalg.cholesky is implemented via a call into Fortran code (see scipy/linalg/decomp_cholesky.py#L27). The input matrix, which is generated in the C layout thus needs to be converted to the F-layout.

Changing the function in your benchmark to convert the input to F-layout prior to measuring timing of the Cholesky decomposition like so:

def one_iteration( size ) :
    A   = numpy.random.uniform( size = ( size, size ) )
    ATA = numpy.dot(A.T, A).copy(order='F')
    start = time.time()
    scipy.linalg.cholesky( ATA, overwrite_a = True, check_finite = False )
    stop = time.time()
    return stop - start

results in better agreement between MKL's and Python's time measurements.

MKL_VERBOSE DPOTRF(U,10000,0x7fd1c050f010,10000,0) 1.64s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16 WDiv:HOST:+0.000
PYTHON_TIME 1.7553684711456299

Rollin_T_ · ‎08-12-2016

Ah of course, thanks, that brings the two numbers into agreement.