logistic regression performance tuning

Zhuk__Michael · ‎11-21-2019

Hello all,
Can someone help me to tune/improve DAAL daal::algorithms::optimization_solver::saga algorithm
performance for optimization_solver::logistic_loss type of function.

We are trying to evaluate performance of a C++ DAAL implementation of logistic regression in comparison with the R glm method.
We expect DAAL performance to be comparable to that of R but in our test it is 100-1000 times slower.
Both R and DAAL are running on linux machines.
With float type (DAAL_ALGORITHM_FP_TYPE) in a C++ example, the solution does not match solution from R. Execution time is 100 times longer.
However the value of objective function is only 0.3% worse.

With double precision optimal parameters match R results. With tol=1e-12 they are almost the same. However execution time is ~1000 time worse.
We use NetBeans as our IDE. A Makefile with project settings generated by NetBeans is attached. We also attached R script and C++ code which Intel recommended for logistic regression testing (saga_logistic_loss_dense_batch.cpp) with minor changes. The test dataset is attached as well.
What would be your recommendation on performance improvement? The goal is to match GLM results in terms of performance and accuracy.

Here is our comparison results:

R Script (GLM):
optParam[1] = 7.3218047
optParam[2] = -7.8593305
optParam[3] = -4.1909632
optParam[4] = -5.2874307
optParam[5] = -0.6053190

Objective function: 0.018181727041912
user system elapsed
0.014 0.008 0.031

DAAL C++, tol = 1e-8, float
Number of Iterations (nIter): 3170358
optParam[0] = 6.7605839 R -> (7.3218047)
optParam[1] = -7.2180438 R -> (-7.8593305)
optParam[2] = -3.8618107 R -> (-4.1909632)
optParam[3] = -4.8597941 R -> (-5.2874307)
optParam[4] = -0.53850645 R -> (-0.6053190)

Objective function: 0.018236298

Time taken: 2.48899542 sec

DAAL C++, tol = 1e-8, double

Number of Iterations (nIter): 6391659
optParam[0] = 7.1449418
optParam[1] = -7.6576166
optParam[2] = -4.0874448
optParam[3] = -5.152936
optParam[4] = -0.58438987

Objective function: 0.018186826

Time taken: 5.73909744 sec

DAAL C++, tol = 1e-12, double

Number of Iterations (nIter): 30863284
optParam[0] = 7.3217854
optParam[1] = -7.8593082
optParam[2] = -4.1909518
optParam[3] = -5.287416
optParam[4] = -0.60531688

Objective function: 0.018181728

Time taken: 27.59499854 sec

DAAL C++, tol = 1e-14, double

Number of Iterations (nIter): 43244720
optParam[0] = 7.321804 R -> (7.32180469)
optParam[1] = -7.8593297 R -> (-7.85933047)
optParam[2] = -4.1909628 R -> (-4.19096320)
optParam[3] = -5.2874303 R -> (-5.28743067)
optParam[4] = -0.60531908 R -> (-0.60531897)

Objective function: 0.018181728

Time taken: 44.17869788 sec

CPU Information:
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Stepping: 1
CPU MHz: 2887.207
CPU max MHz: 3300.0000
CPU min MHz: 1200.0000
BogoMIPS: 4788.61
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0-13,28-41
NUMA node1 CPU(s): 14-27,42-55
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

free
total used free shared buff/cache available
Mem: 527922008 230442304 21134232 1486008 276345472 290434624
Swap: 16777212 4887416 11889796

Thanks! Your help is much appreciated,
Michael

Kirill_S_Intel · ‎11-23-2019

Hello,

By default R glm uses IRLS (Iteratively Reweighted Least Squares) method for fit. It`s not correct to compare performance for different optimization solvers.

DAL saga solver is recommended solver only for L1-regularized logistic regression. For general case of logistic loss optimization SGD momentum, LBFGS solvers are recommended performance oriented methods. By default DAL logistic regression algorithm uses SGD-momentum optimization solver.

Best regards,

Kirill

Zhuk__Michael · ‎11-25-2019

Hi Kirril,

Thanks for the information.

I have a couple of questions:

Is SAGA is the only optimization solver in DAAL library which works for L1 regularization (or other non-smooth parts of objective function)?
Do you have any suggestions on how to tune up SAGA solver performance for the given configuration (see my previous post)?
Do you have any suggestions on how to improve the accuracy of SAGA solver for logistic regression without sacrificing performance (switching from float to double degrades performance significantly)?
What would be a better linkage option with DAAL library ( static or dynamic) from the performance standpoint?

Best regards,

Michael

Kirill_S_Intel · ‎11-25-2019

Hello, Michael

1. For L1 regularized logistic loss function currently only SAGA optimization solver is supported in DAL library. And as I see with L1 regularization term algorithm converges much faster (nIterations: 49768.000, instead of ~3m). For L1 regularized MSE function Coordinate Descent optimization solver is supported, and all other function will be extended with it support in near releases.

2. SAGA solver is supposed to be used for L1 regularization. We can see much faster convergence with non zero L1 term. Also you could try to set/pick up learningRateSequence to reach faster convergence, another option set initial point closer to optimum (as we are not sure that R glm starts from the same initial point).

3. Only increasing the number of iteration and tolerance can help to improve accuracy (for float32 you could try to exit by reaching maximum iteration with tolerance equal to zero).

4. Static linkage is a little bit faster on some algorithms.

Best regards,

Kirill

Zhuk__Michael · ‎11-26-2019

Hi Kirill,

Thank you for the explanation.

One more question.

Our regression matrix is sparse. Is there any way to exploit the sparsity with the DAAL to speed up logistic regression computation and/or minimize memory usage?

Best regards,

Michael

Kirill_S_Intel · ‎11-26-2019

Hi Michael,

You are able to provide sparse matrix (CSRNumericTable) as input, but in the computational kernel of objective function there will be conversion to dense representation of computed batch (if batch size is small there should not be much overhead. All DAL solvers support small batches). So it`s not expected to have faster computation on sparse matrix for current implementation (HomogenNumericTable is performance oriented input type).

Best regards,

Kirill