Parallelization of dpotri and dpotrf

Jochen_S_ · ‎10-31-2014

I measured the time needed to invert a symmetric positive definite matrix with dpotrf and dpotri in parallel on a 32-core Sandy Bridge machine and got quite surprising results:

Although the MKL-documentation says the number of flops for dpotrf is 1/3 n^3 and 2/3 n^3 for dpotrf the runtime results were quite different:

On a 8192x8192 matrix dpotrf took 1.1 sec. and dpotri took 20 sec. On other sizes dpotri always takes more than 10 times the time of dpotrf. For me this is quite surprising as it only has to do twice the flops and the parallelization of dpotri should be easier (I don't know the code of MKL but I know the algebraic operation that dpotri does and it doesn't look that difficult to parallelize it, especially for the Intel experts).

I also tested on some other machines and always got similar results, the parallelization of potrf is very good but the parallelization of potri looks quite slow. Did anyone else get similar results or does anyone know why I get these results?

Thanks

Jochen

mecej4 · ‎10-31-2014

Investigation of anomalous behavior would be facilitated if you posted code (+data, if necessary) to reproduce the behavior.

I am interested, but if I have to write and debug code, and gather valid data, I would probably put it off ...

Jochen_S_ · ‎10-31-2014

The measurements were made inside a larger program. I'll try to build a minimal example which I can post here.

Jochen_S_ · ‎11-03-2014

Here is the minimal example for reproducing the behavior.

The measurements on a 32-core Sandy Bridge machine:

Cholesky Decomposition (dpotrf): about 0.67 +/- 0.01 Inversion (dpotri): 19.06 +/- 0.03 (Faster than the original measurement because of the different input data?)

These measurements were taken with the MKL-version from the composer xe 2011. The same code linked against the MKL-version from the composer xe 2015 produces different results for the inversion:

Cholesky Decomposition (dpotrf): about 0.61 Inversion (dpotri): 2.82 +/- 0.03 a nearly 7-fold improvement for the inversion. But still the inversion step only does 2 times the work but needs 4.5 times the time. I was not aware that the MKL versions can differ that much. The measurements with the composer xe 2015 version are less surprising than my original measurement.

Ying_H_Intel · ‎11-04-2014

Hi Jochen,

Thanks you for the report. I talked about this to our developers. They reply as below,

Right, your observations correspond to what is expected for the current version of MKL. we did optimize to dpotri, As of dpotri, we usually don't recommend using it but suggest using dpotrs, dposv, or dtrsm instead. However according to recent usage statistics we were able to get customers still widely use *tri functions, so we consider to add some extra optimizations for these in future.

Best Regards,

Ying