Intel® oneAPI Data Analytics Library
Community support for building compute-intensive applications that run fast on Intel® architecture.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
204 Discussions

Lipschitz constant for logistic loss function

Gusev__Dmitry
Beginner
1,332 Views

Hello,

Can anyone please help and verify the  Lipschitz constant formula at the bottom of  the page  https://software.intel.com/en-us/daal-programming-guide-logistic-loss ?

Currently, in the Developer Guide document the upper bound for logistic loss Lipschitz constant is specified as max||xi||2 + λ2/n

However,  it could be shown that logistic loss Hessian spectral norm upper bound (rough) is rather max||xi||22 + λ

Your clarification  is much appreciated. 

Thanks,

Dmitry

 

0 Kudos
4 Replies
AthiraM_Intel
Moderator
1,332 Views

Hi,

We are forwarding this case to technical experts.They will assist you shortly.

Thanks

Kirill_S_Intel
Employee
1,332 Views

Hello, Dmitry

Thanks for reporting this statement! Seems instead of  max||xi||2  we should use  max||xi||22 , it was misprint.
λ2 scaling was  done to align sklearn regularization penalty value. (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/_sag.py#L78)

Note: Lipschitz constant is used only for SAGA algorithm, for calculation optimal step size.

https://hal.inria.fr/hal-00860051/document

 

Best regards,

Kirill 

Gusev__Dmitry
Beginner
1,332 Views

Hi Kirill,

Thanks for getting back to me.

Couple of thoughts about  Lipschitz constant:

  1. Please note, that in the formula for logistic loss (https://software.intel.com/en-us/daal-programming-guide-logistic-loss) the regularization term is not divided by n, as the preceding sum term.  As a result, if you plug in zero design matrix into the formula(with no intercept), you will get Hessian exactly 2*λ*E and norm of the Hessian is 2*λ - it is a sharp upper bound. Obviously,  λ/n is less than 2* λ. (In my first post I omitted multiplier 2 since normally quadratic term is divided by 2)
  2. In the original paper https://arxiv.org/pdf/1407.0202.pdf the L2 regularization term is a part of sum, as it is a smooth component and therefore divided by n. That could be the source of λ/n term.
  3.  The max||x||2 is a pretty much rough upper bound for the Lipschitz constant. As a result, the step size is small and the convergence rate may not be optimal.  In my experiments decreasing Lipschitz constant may improve SAGA algorithm performance.

I would like to know your opinion, since I am implementing a custom objective function for the SAGA algorithm and looking for high performance computing with large data sets.

Best regards,

Dmitry.

Kirill_S_Intel
Employee
1,332 Views

Hello, Dmitry

Currently, algorithm supports parameter 'learningRateSequence' (https://software.intel.com/en-us/daal-programming-guide-computation-13), if we see some performance improvements with another step size, DAAL library provides options to use this advantages. For reason of conformance with sklearn library we can`t change this automatic step selection, however 'learningRateSequence' parameter is provided.

About source of scaling λ term by n  I think it make sense.

 

Best regards,

Kirill

Reply