topic Hello, Michael in IntelĀ® oneAPI Data Analytics Library & IntelĀ® Data Analytics Acceleration Library
https://community.intel.com/t5/Intel-oneAPI-Data-Analytics/logistic-regression-performance-tuning/m-p/1144853#M571
<P>Hello, Michael</P><P>1. For L1 regularized logistic loss function currently only SAGA optimization solver is supported in DAL library. And as I see with L1 regularization term algorithm converges much faster (<STRONG>nIterations: 49768.000</STRONG>, instead of ~3m). For L1 regularized MSE function Coordinate Descent optimization solver is supported, and all other function will be extended with it support in near releases.</P><P>2. SAGA solver is supposed to be used for L1 regularization. We can see much faster convergence with non zero L1 term. Also you could try to set/pick up learningRateSequence to reach faster convergence, another option set initial point closer to optimum (as we are not sure that R glm starts from the same initial point). </P><P>3. Only increasing the number of iteration and tolerance can help to improve accuracy (for float32 you could try to exit by reaching maximum iteration with tolerance equal to zero).</P><P>4. Static linkage is a little bit faster on some algorithms.</P><P>Best regards,</P><P>Kirill</P>Tue, 26 Nov 2019 06:49:23 GMTKirill_S_Intel2019-11-26T06:49:23Zlogistic regression performance tuning
https://community.intel.com/t5/Intel-oneAPI-Data-Analytics/logistic-regression-performance-tuning/m-p/1144850#M568
<P>Hello all,<BR />Can someone help me to tune/improve DAAL daal::algorithms::optimization_solver::saga algorithm <BR />performance for optimization_solver::logistic_loss type of function.</P><P>We are trying to evaluate performance of a C++ DAAL implementation of logistic regression in comparison with the R glm method. <BR />We expect DAAL performance to be comparable to that of R but in our test it is 100-1000 times slower.<BR />Both R and DAAL are running on linux machines.<BR />With float type (DAAL_ALGORITHM_FP_TYPE) in a C++ example, the solution does not match solution from R. Execution time is 100 times longer.<BR />However the value of objective function is only 0.3% worse.</P><P>With double precision optimal parameters match R results. With tol=1e-12 they are almost the same. However execution time is ~1000 time worse.<BR />We use NetBeans as our IDE. A Makefile with project settings generated by NetBeans is attached. We also attached R script and C++ code which Intel recommended for logistic regression testing (saga_logistic_loss_dense_batch.cpp) with minor changes. The test dataset is attached as well.<BR />What would be your recommendation on performance improvement? The goal is to match GLM results in terms of performance and accuracy.</P><P>Here is our comparison results:</P><P>R Script (GLM):<BR />optParam[1] = 7.3218047<BR />optParam[2] = -7.8593305<BR />optParam[3] = -4.1909632<BR />optParam[4] = -5.2874307<BR />optParam[5] = -0.6053190</P><P>Objective function: 0.018181727041912<BR /> user system elapsed <BR /> 0.014 0.008 0.031 </P><P>DAAL C++, tol = 1e-8, float<BR />Number of Iterations (nIter): 3170358<BR />optParam[0] = 6.7605839 R -> (7.3218047)<BR />optParam[1] = -7.2180438 R -> (-7.8593305)<BR />optParam[2] = -3.8618107 R -> (-4.1909632)<BR />optParam[3] = -4.8597941 R -> (-5.2874307)<BR />optParam[4] = -0.53850645 R -> (-0.6053190)</P><P>Objective function: 0.018236298</P><P>Time taken: 2.48899542 sec</P><P>DAAL C++, tol = 1e-8, double</P><P>Number of Iterations (nIter): 6391659<BR />optParam[0] = 7.1449418<BR />optParam[1] = -7.6576166<BR />optParam[2] = -4.0874448<BR />optParam[3] = -5.152936<BR />optParam[4] = -0.58438987</P><P>Objective function: 0.018186826</P><P>Time taken: 5.73909744 sec</P><P><BR />DAAL C++, tol = 1e-12, double</P><P>Number of Iterations (nIter): 30863284<BR />optParam[0] = 7.3217854<BR />optParam[1] = -7.8593082<BR />optParam[2] = -4.1909518<BR />optParam[3] = -5.287416<BR />optParam[4] = -0.60531688</P><P>Objective function: 0.018181728</P><P>Time taken: 27.59499854 sec</P><P>DAAL C++, tol = 1e-14, double</P><P>Number of Iterations (nIter): 43244720<BR />optParam[0] = 7.321804 R -> (7.32180469)<BR />optParam[1] = -7.8593297 R -> (-7.85933047)<BR />optParam[2] = -4.1909628 R -> (-4.19096320)<BR />optParam[3] = -5.2874303 R -> (-5.28743067)<BR />optParam[4] = -0.60531908 R -> (-0.60531897)</P><P>Objective function: 0.018181728</P><P>Time taken: 44.17869788 sec</P><P><BR />CPU Information:<BR />lscpu<BR />Architecture: x86_64<BR />CPU op-mode(s): 32-bit, 64-bit<BR />Byte Order: Little Endian<BR />CPU(s): 56<BR />On-line CPU(s) list: 0-55<BR />Thread(s) per core: 2<BR />Core(s) per socket: 14<BR />Socket(s): 2<BR />NUMA node(s): 2<BR />Vendor ID: GenuineIntel<BR />CPU family: 6<BR />Model: 79<BR />Model name: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz<BR />Stepping: 1<BR />CPU MHz: 2887.207<BR />CPU max MHz: 3300.0000<BR />CPU min MHz: 1200.0000<BR />BogoMIPS: 4788.61<BR />Virtualization: VT-x<BR />L1d cache: 32K<BR />L1i cache: 32K<BR />L2 cache: 256K<BR />L3 cache: 35840K<BR />NUMA node0 CPU(s): 0-13,28-41<BR />NUMA node1 CPU(s): 14-27,42-55<BR />Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts</P><P>free<BR /> total used free shared buff/cache available<BR />Mem: 527922008 230442304 21134232 1486008 276345472 290434624<BR />Swap: 16777212 4887416 11889796</P><P>Thanks! Your help is much appreciated,<BR />Michael<BR /> </P>Thu, 21 Nov 2019 21:28:00 GMThttps://community.intel.com/t5/Intel-oneAPI-Data-Analytics/logistic-regression-performance-tuning/m-p/1144850#M568Zhuk__Michael2019-11-21T21:28:00ZHello,
https://community.intel.com/t5/Intel-oneAPI-Data-Analytics/logistic-regression-performance-tuning/m-p/1144851#M569
<P>Hello,</P><P>By default R glm uses IRLS (Iteratively Reweighted Least Squares) method for fit. It`s not correct to compare performance for different optimization solvers.</P><P>DAL saga solver is recommended solver only for L1-regularized logistic regression. For general case of logistic loss optimization SGD momentum, LBFGS solvers are recommended performance oriented methods. By default DAL logistic regression algorithm uses SGD-momentum optimization solver.</P><P>Best regards,</P><P>Kirill </P>Sat, 23 Nov 2019 10:17:07 GMThttps://community.intel.com/t5/Intel-oneAPI-Data-Analytics/logistic-regression-performance-tuning/m-p/1144851#M569Kirill_S_Intel2019-11-23T10:17:07ZHi Kirril,
https://community.intel.com/t5/Intel-oneAPI-Data-Analytics/logistic-regression-performance-tuning/m-p/1144852#M570
<P style="margin-left:0in; margin-right:0in">Hi Kirril,</P><P style="margin-left:0in; margin-right:0in">Thanks for the information.</P><P style="margin-left:0in; margin-right:0in">I have a couple of questions:</P><OL><LI>Is SAGA is the only optimization solver in DAAL library which works for L1 regularization (or other non-smooth parts of objective function)?</LI><LI>Do you have any suggestions on how to tune up SAGA solver performance for the given configuration (see my previous post)?</LI><LI>Do you have any suggestions on how to improve the accuracy of SAGA solver for logistic regression without sacrificing performance (switching from float to double degrades performance significantly)?</LI><LI>What would be a better linkage option with DAAL library ( static or dynamic) from the performance standpoint?</LI></OL><P style="margin-left:0in; margin-right:0in">Best regards,</P><P style="margin-left:0in; margin-right:0in"> </P><P style="margin-left:0in; margin-right:0in">Michael</P>Mon, 25 Nov 2019 20:39:06 GMThttps://community.intel.com/t5/Intel-oneAPI-Data-Analytics/logistic-regression-performance-tuning/m-p/1144852#M570Zhuk__Michael2019-11-25T20:39:06ZHello, Michael
https://community.intel.com/t5/Intel-oneAPI-Data-Analytics/logistic-regression-performance-tuning/m-p/1144853#M571
<P>Hello, Michael</P><P>1. For L1 regularized logistic loss function currently only SAGA optimization solver is supported in DAL library. And as I see with L1 regularization term algorithm converges much faster (<STRONG>nIterations: 49768.000</STRONG>, instead of ~3m). For L1 regularized MSE function Coordinate Descent optimization solver is supported, and all other function will be extended with it support in near releases.</P><P>2. SAGA solver is supposed to be used for L1 regularization. We can see much faster convergence with non zero L1 term. Also you could try to set/pick up learningRateSequence to reach faster convergence, another option set initial point closer to optimum (as we are not sure that R glm starts from the same initial point). </P><P>3. Only increasing the number of iteration and tolerance can help to improve accuracy (for float32 you could try to exit by reaching maximum iteration with tolerance equal to zero).</P><P>4. Static linkage is a little bit faster on some algorithms.</P><P>Best regards,</P><P>Kirill</P>Tue, 26 Nov 2019 06:49:23 GMThttps://community.intel.com/t5/Intel-oneAPI-Data-Analytics/logistic-regression-performance-tuning/m-p/1144853#M571Kirill_S_Intel2019-11-26T06:49:23ZHi Kirill,
https://community.intel.com/t5/Intel-oneAPI-Data-Analytics/logistic-regression-performance-tuning/m-p/1144854#M572
<P style="margin-left:0in; margin-right:0in">Hi Kirill,</P><P style="margin-left:0in; margin-right:0in">Thank you for the explanation.</P><P style="margin-left:0in; margin-right:0in">One more question.</P><P style="margin-left:0in; margin-right:0in">Our regression matrix is sparse. Is there any way to exploit the sparsity with the DAAL to speed up logistic regression computation and/or minimize memory usage?</P><P style="margin-left:0in; margin-right:0in">Best regards,</P><P style="margin-left:0in; margin-right:0in">Michael</P>Tue, 26 Nov 2019 15:58:06 GMThttps://community.intel.com/t5/Intel-oneAPI-Data-Analytics/logistic-regression-performance-tuning/m-p/1144854#M572Zhuk__Michael2019-11-26T15:58:06ZHi Michael,
https://community.intel.com/t5/Intel-oneAPI-Data-Analytics/logistic-regression-performance-tuning/m-p/1144855#M573
<P>Hi Michael,</P><P>You are able to provide sparse matrix (CSRNumericTable) as input, but in the computational kernel of objective function there will be conversion to dense representation of computed batch (if batch size is small there should not be much overhead. All DAL solvers support small batches). So it`s not expected to have faster computation on sparse matrix for current implementation (HomogenNumericTable is performance oriented input type).<BR /><BR />Best regards,</P><P>Kirill</P>Wed, 27 Nov 2019 05:16:15 GMThttps://community.intel.com/t5/Intel-oneAPI-Data-Analytics/logistic-regression-performance-tuning/m-p/1144855#M573Kirill_S_Intel2019-11-27T05:16:15Z