SVD speed of 'small' matrices in MKL 2018_0_124

MiauCat · ‎10-20-2017

I'm using SVD during some least-square fitting, typically operating on spectral data (1000-2000 data points) and fitting with very few parameters (2-5).

For this, I'm generally using a direct implementaion of the SVD routines from the "numerical recipes" (single-threaded).

When I started needing SVDs in other areas (bigger matrices with a less extreme aspect ratio, typtically ~ 10000 x 1000) I started using MKL Lapacke, currenlty using version 2017_4_210 and here the routines greatly outperform the NR routines.

So I also started using them for the fitting as described above. However, when applying it to the "extreme" data of only very few parameters ( typical matrix size 2048 x 3 ), the Lapacke routines fell behind and the NR routines are just faster.

Just as a "guideline": Running the same (iterative) fitting on a typical standard data-set, my profile tells me I'm staying with the SVD-routines for about 4sec using NR routines and for about 7sec with the MKL routines)

Now, when MKL 2018 was announced a month ago, I was quite excited to read in the Release Notes (https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-2018-release-notes):

LAPACK:

Added the following improvements and optimizations for small matrices (N<16):
Added ?gesvd, ?geqr/?gemqr, ?gelq/?gemlq optimizations for tall-and-skinny/short-and-wide matrice

So I gave it a try, but was quite disappointed. Not only did the NR still outperfrom MKL routines, but for reasons not clear to me, the performance actually dropped significantly in the 2018_0_124 MKL compared to the 2017_4_210 version.

The same data for guideline:
- NR routines: 4sec
- MKL 2017: 7sec
- MKL 2018: 14sec

The only changes I did when comparing both variantes was to re-compile/link with the newer version and use the according new version DLLs.
Did I miss something? Or did I misunderstand the release notes? Does anybody have some other comparative data for running SVDs on matrices of size ( 2048 x 3 ) which will help me figure out whether it is problem of the lirbary or of my implementation of it?

I ran my tests on 8 cores enabled on a (4 core hyper-threaded i7-4712 HQ).

MiauCat · ‎10-20-2017

I'm also copying in here a few "measured" speeds for matrices of specific size.

The data in the matrices is uniform-random between 0 and 1000.

I'm using double (8 byte) floating point data arrays.

SPEED for [ 100 x 5 ]: Averaged over 10000 iterations
   NR        :      1.470 sec {   0.000147 sec/op }
   MKL svd 2017:      0.598 sec {   5.98e-05 sec/op }
   MKL svd 2018:      0.637 sec {   6.37e-05 sec/op }
   MKL sdd 2017:      0.600 sec {      6e-05 sec/op }
   MKL sdd 2018:      0.630 sec {    6.3e-05 sec/op }

SPEED for [ 100 x 5 ]: Averaged over 10000 iterations
   NR           :      1.470 sec {   0.000147 sec/op }
   MKL svd 2017:      0.597 sec {   5.97e-05 sec/op }
   MKL svd 2018:      0.604 sec {   6.04e-05 sec/op }
   MKL sdd 2017:      0.601 sec {   6.01e-05 sec/op }
   MKL sdd 2018:      0.606 sec {   6.06e-05 sec/op }

************************************************************

SPEED for [ 5 x 100 ]: Averaged over 10000 iterations
   NR        :      0.208 sec {   2.08e-05 sec/op }
   MKL svd 2017:      0.516 sec {   5.16e-05 sec/op }
   MKL svd 2018:      0.836 sec {   8.36e-05 sec/op }
   MKL sdd 2017:      0.601 sec {   6.01e-05 sec/op }
   MKL sdd 2018:      0.781 sec {   7.81e-05 sec/op }

SPEED for [ 5 x 100 ]: Averaged over 10000 iterations
   NR        :      0.215 sec {   2.15e-05 sec/op }
   MKL svd 2017:      0.510 sec {    5.1e-05 sec/op }
   MKL svd 2018:      0.832 sec {   8.32e-05 sec/op }
   MKL sdd 2017:      0.600 sec {      6e-05 sec/op }
   MKL sdd 2018:      0.741 sec {   7.41e-05 sec/op }

************************************************************

SPEED for [ 5 x 1000 ]: Averaged over 10000 iterations
   NR        :      1.860 sec {   0.000186 sec/op }
   MKL svd 2017:      1.260 sec {   0.000126 sec/op }
   MKL svd 2018:      2.720 sec {   0.000272 sec/op }
   MKL sdd 2017:      2.540 sec {   0.000254 sec/op }
   MKL sdd 2018:      3.650 sec {   0.000365 sec/op }

SPEED for [ 5 x 1000 ]: Averaged over 10000 iterations
   NR        :      1.940 sec {   0.000194 sec/op }
   MKL svd 2017:      1.240 sec {   0.000124 sec/op }
   MKL svd 2018:      2.690 sec {   0.000269 sec/op }
   MKL sdd 2017:      2.520 sec {   0.000252 sec/op }
   MKL sdd 2018:      3.620 sec {   0.000362 sec/op }

************************************************************

SPEED for [ 3 x 1000 ]: Averaged over 10000 iterations
   NR        :      0.657 sec {   6.57e-05 sec/op }
   MKL svd 2017:      0.740 sec {    7.4e-05 sec/op }
   MKL svd 2018:      2.090 sec {   0.000209 sec/op }
   MKL sdd 2017:      1.630 sec {   0.000163 sec/op }
   MKL sdd 2018:      2.910 sec {   0.000291 sec/op }

SPEED for [ 3 x 1000 ]: Averaged over 10000 iterations
   NR        :      0.669 sec {   6.69e-05 sec/op }
   MKL svd 2017:      0.754 sec {   7.54e-05 sec/op }
   MKL svd 2018:      2.070 sec {   0.000207 sec/op }
   MKL sdd 2017:      1.690 sec {   0.000169 sec/op }
   MKL sdd 2018:      2.870 sec {   0.000287 sec/op }

Gennady_F_Intel · ‎10-20-2017

Could you share the example of the code you use for this perf comparision?

how do link? OS?

in any case if you see if the same routine from v.2018 works slower then from v.2017 - this is the problem.

Konstantin_A_Intel · ‎10-27-2017

Indeed, we introduced a degradation in MKL 2018. We will try to fix the problem ASAP, and will let you know when the fix is available.

Regards,

Konstantin

jr___shishu · ‎08-05-2018

I would like to know if this problem was fixed now (version 2018 update 3).

thank you

Gennady_F_Intel · ‎08-07-2018

yes, please try the latest update and let us know the result

AndrewC · ‎08-10-2018

One thing I would suggest is that you be careful that the work arrays assigned to MKL are of sufficient size. If they are too small this can have a very significant effect on performance.

MiauCat · ‎09-06-2018

I have done some comparison with the 2018_3_210 version now, and I can confirm that the slowdown of the 2018_0_124 version for small matrices has been fixed.

2018_3_210 compares pretty much to the speeds of 2017_4_210 for these matrices. ( 5x100, 5x1000, 5x3000 )

However, the single threaded NumericalRecepies still beat MKL at these scenarios by far, so I'm still using that for some simple fitting.

SPEED for [ 5 x 100 ]: Averaged over 10000 iterations
   NR                :      0.224 sec {   2.24e-05 sec/op }
   MKL svd 2018_3_210:       1.64 sec {   0.000164 sec/op }
   MKL sdd 2018_3_210:       1.54 sec {   0.000154 sec/op }

SPEED for [ 5 x 1000 ]: Averaged over 10000 iterations
   NR                :       1.89 sec {   0.000189 sec/op }
   MKL svd 2018_3_210:        2.4 sec {    0.00024 sec/op }
   MKL sdd 2018_3_210:       3.34 sec {   0.000334 sec/op }

SPEED for [ 5 x 3000 ]: Averaged over 10000 iterations
   NR                :       5.47 sec {   0.000547 sec/op }
   MKL svd 2018_3_210:       4.29 sec {   0.000429 sec/op }
   MKL sdd 2018_3_210:       7.45 sec {   0.000745 sec/op }

Edit: I should add that the absolute comparison numbers differ from those posted a year ago. I compared 2018_3_210, 2017_4_210 and 2018_0_124 completly anew with my system. And there 2017 & 2018 are similar whereas 2018_0_124 is still worse than both. The values above are for my new setup.

MiauCat · ‎09-06-2018

vasci_ wrote:

One thing I would suggest is that you be careful that the work arrays assigned to MKL are of sufficient size. If they are too small this can have a very significant effect on performance.

Aren't the work-arrays fixed anyway? I'm a bit surprised here. Could you give an example of a good vs a bad call?

Following https://software.intel.com/en-us/mkl-developer-reference-c-gesvd

wouldn't I, for a 5x1000 matrix not just call

LAPACKE_dgesvd( matrix_layout, jobu, jobvt, m_, n_, a, lda_, s, u, ldu_, vt, ldvt_, superb );

with:

    matrix_layout = LAPACK_ROW_MAJOR
    jobu = 'O'
    jobu = 'S'

    m_ = 1000
    n_ = 5
    lda_ = 5
    ldu_ = 5
    ldvt_ = 5

    a = array[5x1000]
    u = array[5x1000]
    s = array[5]
    vt = array[5x5]
    superb == array[4]

Would making any of the arrays bigger make a difference here?