From the C++ example code provided by DAAL, it seems that the only algorithm input for multivariate outlier detection is the data, and that estimates of the vector of means and (co)variance matrix are computed by the algorithm itself (i.e., the user does not prodide them). Is that correct? If it is the case, are the estimates of the vector of means and (co)variance matrix robust? How are these estimates computed?
The default method for multivariate outlier detection does not compute the estimates for (robust) mean and covariance itself and expects the input data and initialization procedure as the parameter of the algorithm. The initialization procedure is intended to set the parameters of the algorithm such as the vector of means, variance-covariance matrix, and scalar that defines the outlier region. This is the responsibility to define the procedure. Please, see the Intel DAAL Developer Guide for additional details at https://software.intel.com/en-us/node/564655.
If you do not provide the initialization procedure, the algorithm will rely on the default initialization available in the library. It initializes vector of means with zeros, covariance matrix with one in the major diagonal and zeros in other cells, and sets the threshold for outlier region to 3.
The present version of Intel DAAL does not support the algorithms for evaluation of robust mean and covariance. Statistical functions in Intel MKL provide such methods, see https://software.intel.com/en-us/node/521928#98D2C49D-ACF1-4247-AC1B-8E9356CAA8C8 and https://software.intel.com/en-us/node/610608)
Intel DAAL provides another method for multivariate outlier detection, BACON. This method processes the input matrix and requires extra parameters such as stopping criteria. It does not require estimates such as mean or covariance. However, the algorithmic flow relies on the assumption of the multivariate Gaussian distribution underneath of input data.
Please, let me know, if you need more clarifications and details on the outlier detection algorithms available in Intel DAAL.
Thank you for the details. It could explain my results.
For BACON, it seems that the number of variables must not be 5 times greater than the number of observations for the MKL implementation. Is it also the case for Intel DAAL?
Intel DAAL version of the BACON algorithm internally relies on Intel MKL that applies all respective checks of the input arguments such as "number of observations < 5 * number of variables".