Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

Zhuk__Michael

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-21-2019
01:28 PM

222 Views

logistic regression performance tuning

Hello all,

Can someone help me to tune/improve DAAL daal::algorithms::optimization_solver::saga algorithm

performance for optimization_solver::logistic_loss type of function.

We are trying to evaluate performance of a C++ DAAL implementation of logistic regression in comparison with the R glm method.

We expect DAAL performance to be comparable to that of R but in our test it is 100-1000 times slower.

Both R and DAAL are running on linux machines.

With float type (DAAL_ALGORITHM_FP_TYPE) in a C++ example, the solution does not match solution from R. Execution time is 100 times longer.

However the value of objective function is only 0.3% worse.

With double precision optimal parameters match R results. With tol=1e-12 they are almost the same. However execution time is ~1000 time worse.

We use NetBeans as our IDE. A Makefile with project settings generated by NetBeans is attached. We also attached R script and C++ code which Intel recommended for logistic regression testing (saga_logistic_loss_dense_batch.cpp) with minor changes. The test dataset is attached as well.

What would be your recommendation on performance improvement? The goal is to match GLM results in terms of performance and accuracy.

Here is our comparison results:

R Script (GLM):

optParam[1] = 7.3218047

optParam[2] = -7.8593305

optParam[3] = -4.1909632

optParam[4] = -5.2874307

optParam[5] = -0.6053190

Objective function: 0.018181727041912

user system elapsed

0.014 0.008 0.031

DAAL C++, tol = 1e-8, float

Number of Iterations (nIter): 3170358

optParam[0] = 6.7605839 R -> (7.3218047)

optParam[1] = -7.2180438 R -> (-7.8593305)

optParam[2] = -3.8618107 R -> (-4.1909632)

optParam[3] = -4.8597941 R -> (-5.2874307)

optParam[4] = -0.53850645 R -> (-0.6053190)

Objective function: 0.018236298

Time taken: 2.48899542 sec

DAAL C++, tol = 1e-8, double

Number of Iterations (nIter): 6391659

optParam[0] = 7.1449418

optParam[1] = -7.6576166

optParam[2] = -4.0874448

optParam[3] = -5.152936

optParam[4] = -0.58438987

Objective function: 0.018186826

Time taken: 5.73909744 sec

DAAL C++, tol = 1e-12, double

Number of Iterations (nIter): 30863284

optParam[0] = 7.3217854

optParam[1] = -7.8593082

optParam[2] = -4.1909518

optParam[3] = -5.287416

optParam[4] = -0.60531688

Objective function: 0.018181728

Time taken: 27.59499854 sec

DAAL C++, tol = 1e-14, double

Number of Iterations (nIter): 43244720

optParam[0] = 7.321804 R -> (7.32180469)

optParam[1] = -7.8593297 R -> (-7.85933047)

optParam[2] = -4.1909628 R -> (-4.19096320)

optParam[3] = -5.2874303 R -> (-5.28743067)

optParam[4] = -0.60531908 R -> (-0.60531897)

Objective function: 0.018181728

Time taken: 44.17869788 sec

CPU Information:

lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 56

On-line CPU(s) list: 0-55

Thread(s) per core: 2

Core(s) per socket: 14

Socket(s): 2

NUMA node(s): 2

Vendor ID: GenuineIntel

CPU family: 6

Model: 79

Model name: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz

Stepping: 1

CPU MHz: 2887.207

CPU max MHz: 3300.0000

CPU min MHz: 1200.0000

BogoMIPS: 4788.61

Virtualization: VT-x

L1d cache: 32K

L1i cache: 32K

L2 cache: 256K

L3 cache: 35840K

NUMA node0 CPU(s): 0-13,28-41

NUMA node1 CPU(s): 14-27,42-55

Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

free

total used free shared buff/cache available

Mem: 527922008 230442304 21134232 1486008 276345472 290434624

Swap: 16777212 4887416 11889796

Thanks! Your help is much appreciated,

Michael

5 Replies

Highlighted
##

Kirill_S_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-23-2019
02:17 AM

222 Views

Hello,

By default R glm uses IRLS (Iteratively Reweighted Least Squares) method for fit. It`s not correct to compare performance for different optimization solvers.

DAL saga solver is recommended solver only for L1-regularized logistic regression. For general case of logistic loss optimization SGD momentum, LBFGS solvers are recommended performance oriented methods. By default DAL logistic regression algorithm uses SGD-momentum optimization solver.

Best regards,

Kirill

Highlighted
##

Zhuk__Michael

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-25-2019
12:39 PM

222 Views

Hi Kirril,

Thanks for the information.

I have a couple of questions:

- Is SAGA is the only optimization solver in DAAL library which works for L1 regularization (or other non-smooth parts of objective function)?
- Do you have any suggestions on how to tune up SAGA solver performance for the given configuration (see my previous post)?
- Do you have any suggestions on how to improve the accuracy of SAGA solver for logistic regression without sacrificing performance (switching from float to double degrades performance significantly)?
- What would be a better linkage option with DAAL library ( static or dynamic) from the performance standpoint?

Best regards,

Michael

Highlighted
##

Kirill_S_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-25-2019
10:49 PM

222 Views

Hello, Michael

1. For L1 regularized logistic loss function currently only SAGA optimization solver is supported in DAL library. And as I see with L1 regularization term algorithm converges much faster (**nIterations: 49768.000**, instead of ~3m). For L1 regularized MSE function Coordinate Descent optimization solver is supported, and all other function will be extended with it support in near releases.

2. SAGA solver is supposed to be used for L1 regularization. We can see much faster convergence with non zero L1 term. Also you could try to set/pick up learningRateSequence to reach faster convergence, another option set initial point closer to optimum (as we are not sure that R glm starts from the same initial point).

3. Only increasing the number of iteration and tolerance can help to improve accuracy (for float32 you could try to exit by reaching maximum iteration with tolerance equal to zero).

4. Static linkage is a little bit faster on some algorithms.

Best regards,

Kirill

Highlighted
##

Zhuk__Michael

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-26-2019
07:58 AM

222 Views

Hi Kirill,

Thank you for the explanation.

One more question.

Our regression matrix is sparse. Is there any way to exploit the sparsity with the DAAL to speed up logistic regression computation and/or minimize memory usage?

Best regards,

Michael

Kirill_S_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-26-2019
09:16 PM

222 Views

Hi Michael,

You are able to provide sparse matrix (CSRNumericTable) as input, but in the computational kernel of objective function there will be conversion to dense representation of computed batch (if batch size is small there should not be much overhead. All DAL solvers support small batches). So it`s not expected to have faster computation on sparse matrix for current implementation (HomogenNumericTable is performance oriented input type).

Best regards,

Kirill

For more complete information about compiler optimizations, see our Optimization Notice.