Intel® oneAPI Data Analytics Library
Learn from community members on how to build compute-intensive applications that run efficiently on Intel® architecture.
223 Discussions

## Lipschitz constant for logistic loss function

Beginner
3,181 Views

Hello,

Can anyone please help and verify the  Lipschitz constant formula at the bottom of  the page  https://software.intel.com/en-us/daal-programming-guide-logistic-loss ?

Currently, in the Developer Guide document the upper bound for logistic loss Lipschitz constant is specified as max||xi||2 + λ2/n

However,  it could be shown that logistic loss Hessian spectral norm upper bound (rough) is rather max||xi||22 + λ

Thanks,

Dmitry

4 Replies
Moderator
3,181 Views

Hi,

We are forwarding this case to technical experts.They will assist you shortly.

Thanks

Employee
3,181 Views

Hello, Dmitry

Thanks for reporting this statement! Seems instead of  max||xi||2  we should use  max||xi||22 , it was misprint.
λ2 scaling was  done to align sklearn regularization penalty value. (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/_sag.py#L78)

Note: Lipschitz constant is used only for SAGA algorithm, for calculation optimal step size.

https://hal.inria.fr/hal-00860051/document

Best regards,

Kirill

Beginner
3,181 Views

Hi Kirill,

Thanks for getting back to me.

Couple of thoughts about  Lipschitz constant:

1. Please note, that in the formula for logistic loss (https://software.intel.com/en-us/daal-programming-guide-logistic-loss) the regularization term is not divided by n, as the preceding sum term.  As a result, if you plug in zero design matrix into the formula(with no intercept), you will get Hessian exactly 2*λ*E and norm of the Hessian is 2*λ - it is a sharp upper bound. Obviously,  λ/n is less than 2* λ. (In my first post I omitted multiplier 2 since normally quadratic term is divided by 2)
2. In the original paper https://arxiv.org/pdf/1407.0202.pdf the L2 regularization term is a part of sum, as it is a smooth component and therefore divided by n. That could be the source of λ/n term.
3.  The max||x||2 is a pretty much rough upper bound for the Lipschitz constant. As a result, the step size is small and the convergence rate may not be optimal.  In my experiments decreasing Lipschitz constant may improve SAGA algorithm performance.

I would like to know your opinion, since I am implementing a custom objective function for the SAGA algorithm and looking for high performance computing with large data sets.

Best regards,

Dmitry.

Employee
3,181 Views

Hello, Dmitry

Currently, algorithm supports parameter 'learningRateSequence' (https://software.intel.com/en-us/daal-programming-guide-computation-13), if we see some performance improvements with another step size, DAAL library provides options to use this advantages. For reason of conformance with sklearn library we can`t change this automatic step selection, however 'learningRateSequence' parameter is provided.

About source of scaling λ term by n  I think it make sense.

Best regards,

Kirill