Usage of Outlier Detection

kingking_r_ · ‎11-17-2014

I am considering whether to use the outlier detection routine in the Vector Statistical Library (VSL).

Suppose I have an unknown monotonic nonlinear function f(x). I somehow evaluated it at 1000 different x points. By ploting these data, I found an almost monotonic and nonlinear curve with one or two dozens of outliers. So can I make use of the mkl routine? https://software.intel.com/en-us/node/497940

The routine deals with n observations on p variables, which sounds like it is for many outputs from p random generators, say, a set of multivariate Gaussian distribution data. If so, my case looks different. I'm not sure about this. Please kindly help me on this.

Thanks in advance!

Andrey_N_Intel · ‎11-17-2014

Hello,

Yes, Intel MKL version of outlier detection algorithm relies on the assumption that the dataset comes from the multi-variate Gaussian distribution. Thus, the direct use of this outlier detection scheme appears not to make sense for your specific case. At the same time, I wonder, if the following approach would work. Is it possible to represent your unknown function f as f = s + n, where s is "useful signal" and n is "noise"? Function f is represented as a set of x points, the function s might be evaluated using something like moving averaging applied to the data set. Subtraction of s from f will give the estimate of n. We now assume the noise follows Gaussian distribution, and, thus, use the outlier detection algorithm.

Thanks, Andrey

mecej4 · ‎11-17-2014

In normal data fitting work, one knows in advance the form of the function f(x), but that function has unknown coefficients. After fitting, the residuals are expected to be normally distributed. If, however, you take f(x) = 0, which is equivalent to setting the residuals equal to the data, or the selected f(x) is not appropriate for the data, the residuals will probably not be normally distributed. The user has to select a form for f(x) that is appropriate to the application. If a single one is not evident, try a few variants f(x) that are reasonable.

There is a method for rating whether a selected function is appropriate. See en.wikipedia.org/wiki/Normal_probability_plot.