MKL Summary Statistics computation of skewness and kurtosis seems to be broken when the standard deviation is more than four orders of magnitude small...

tennican · ‎11-04-2011

MKL Summary Statistics computation of skewness and kurtosis seems to be broken when the standard deviation is more than four orders of magnitude smaller than the mean.
Can anyone tell me how to fix this, please?

For example, given normal random samples around 2, I should get skewness and excess kurtosis close to zero. However, at a standard deviation of 1.0e-4, Excess kurtosis jumps to -311. At 1.0e-5, skewness jumps to 11. At 1.0e-6, the absolute values of both skewness and kurtosis are gigantic.

In R, I get the correct values much longer.

I have confirmed that I can workaround this problem by copying the data, subtracting the mean and computing the moments with another MKL task.
But, man, this is ugly.

Here is my C++ code for computing summary stats with MKL
------------------------------------------------------------------
// Setup a Summary Statistic task pointing at the data
int status;
VSLSSTaskPtr task;
MKL_INT numRows = m_count;
MKL_INT numCols = 1;
MKL_INT xstorage = VSL_SS_MATRIX_STORAGE_ROWS;
int indices[1] = {1};
status = vsldSSNewTask( &task, &numCols, &numRows, &xstorage, &(m_data[0]), 0, indices );

// Setup 1st, 2nd and 3rd quartile computation
double quants[3];
double q_order[3] = {.25,.5,.75};
MKL_INT q_order_n = 3;
status = vsldSSEditQuantiles( task, &q_order_n, q_order, quants, NULL, NULL );

// Setup max and min computation
double min_est = m_data[0];
double max_est = m_data[0];
status = vsldSSEditTask(task, VSL_SS_ED_MIN, &min_est);
status = vsldSSEditTask(task, VSL_SS_ED_MAX, &max_est);

// Setup moment computation. Computing kurtosis requires all the other moments.
double mean_est, skewness_est, kurtosis_est;
double raw2mom, raw3mom, raw4mom;
double cen2mom, cen3mom, cen4mom;
status = vsldSSEditTask(task, VSL_SS_ED_SKEWNESS, &skewness_est);
status = vsldSSEditTask(task, VSL_SS_ED_KURTOSIS, &kurtosis_est);
status = vsldSSEditMoments( task, &mean_est, &raw2mom, &raw3mom, &raw4mom, &cen2mom, &cen3mom, &cen4mom );

// Request five number summary and moment computation
MKL_UINT64 mask = VSL_SS_MEAN | VSL_SS_KURTOSIS | VSL_SS_SKEWNESS |
VSL_SS_2R_MOM | VSL_SS_3R_MOM | VSL_SS_4R_MOM |
VSL_SS_2C_MOM | VSL_SS_3C_MOM | VSL_SS_4C_MOM |
VSL_SS_QUANTS | VSL_SS_MIN | VSL_SS_MAX;
status = vsldSSCompute(task, mask, VSL_SS_METHOD_FAST);
--------------------------------------------------------------------------

Here is the R code that I used to generate the test data as well and check skewness and kurtosis:
-----------------------------------------------------------------------
library(moments)
setwd("C:/Users/stennican/Documents/Rcode")

N = 500
normDist = rnorm(N)
trial = seq(1,N)
twos = rep(2,N)
twoSd1 = twos + 1e-1*normDist
twoSd2 = twos + 1e-2*normDist
twoSd3 = twos + 1e-3*normDist
twoSd4 = twos + 1e-4*normDist
twoSd5 = twos + 1e-5*normDist
twoSd7 = twos + 1e-7*normDist
twoSd10 = twos + 1e-10*normDist
twoSd13 = twos + 1e-13*normDist
zeros = rep(0,N)
zeroSd1 = 1e-1*normDist
zeroSd2 = 1e-2*normDist
zeroSd3 = 1e-3*normDist
zeroSd4 = 1e-4*normDist
zeroSd5 = 1e-5*normDist
zeroSd7 = 1e-7*normDist
zeroSd10 = 1e-10*normDist
zeroSd13 = 1e-13*normDist

out = data.frame(trial,zeros,twos,
zeroSd1,zeroSd2,zeroSd3,zeroSd4,zeroSd5,
zeroSd7,zeroSd10,zeroSd13,
twoSd1,twoSd2,twoSd3,twoSd4,twoSd5,
twoSd7,twoSd10,twoSd13)

apply(out,2,sd)
apply(out,2,skewness)
apply(out,2,kurtosis)

options(digits = 22)
write.csv(out,"testB42382.csv",row.names=FALSE)

Andrey_N_Intel · ‎11-07-2011

Hi Scott,
Thanks for the post and the detailed description of the issue. We will have a look atthe dataset and the test case on our side. Can you please clarifywhat the value of the m_count variable in your code is? Is it equal to 500?
Thanks,
Andrey

tennican · ‎11-07-2011

Hi Andrey,

Yes, m_count is 500 because I just imported the data generated in the R code where N = 500.
I believe there is some change in the behavior if you vary sample size from 50 to 500 to 5000.
But, I wasn't testing in an organized fashion at the time. So, I'd have to retest to be sure.

thanks for you help, Scott

Sergey_M_Intel2 · ‎11-07-2011

Hi Scott,

May I clarify one more thing? Is this behavior reproducible on different random sequences? In other words, isn't it because of large statistical error due to relatively small sample size?

Sergey

tennican · ‎11-08-2011

Hi Sergey,

Yes, it is reproducible with various random samples of various sizes.
Most would call 500 a large sample since the CLT is working for you big time at that point.
I haven't tried any small samples of less than 30 yet.
Let me know if you need me to do additional testing and send you the results.

thanks, Scott

Andrey_N_Intel · ‎11-08-2011

Hi Scott,
We did the analysis of the datasetsyou work with,computed confidence intervals for sample kurtosis for the dataset from Gaussian distribution, and the output of Summary Stats routines for kurtosis/skewness estimation.
The current conlusion is that it makes sensefor us tounderstand the opportunities for having the algorithms for computation of the moments/covariance which would bemore tolerant to numerical errors to address the case of the datasets with variance several orders smaller than its expectation.
I wonder if the work-around you mention earlier and which sounds like a two pass algorithm would currently work for your needs?
Also, canyouplease specify howoftenyouget the datasets with such structure in your work?
Iwould appreciate if you could also clarify if you work with in-memory datasets only, or data that can have such structure and which does not fit into memory of the computer is also the case?
Thanks,
Andrey

tennican · ‎11-08-2011

Hi Andrey,

Our tool makes it very easy for people with no statistical background to access any sort of database, transform columns of data from that database and then request summary stats for the column. So, while it doesn't make sense to care about the kurtosis of a variable with almost no variance, we will have to compute this at some point. For a constant, kurtosis is undefined and it is correct to say so. But, for small variance, kurtosis and skewness are well defined and we should report them correctly.

The data on which MKL operates is in an in-memory database.

As I said before, I can work around the problem by first computing the mean of the original data, subtracting the mean from a copy of the data and computing the higher moments using the demeaned data.
The extra cost in time and memory are acceptable in the short term.

How long do you think it will take for MKL to be improved to match the precision of R in this respect?

thanks, Scott

Andrey_N_Intel · ‎11-08-2011

Hi Scott,
We would do the deeper analysis of your requestwithin the next several weeks. Once the analysisis completedwe would understand the next steps.
Does it sound reasonable for you?
Thanks,
Andrey

tennican · ‎11-09-2011

Hi Andrey,

Sure, I'm not in a huge rush.
The workaround only triggers in this very special case.
So, it won't impact performance.

thanks, Scott

Sergey_M_Intel2 · ‎11-09-2011

Hi Scott,

One more clarification to be 100% sure that we address your needs.

You indicated that majority of your customers are not experts in statistics so you would like to provide basic statistics capabilities as simple (to use) as possible. Does it mean that you want to avoid giving customers some controls to let your software know about properties of incoming dataset?

For example, Andrey gave you an example of the one-pass method for the mean estimate, which is rather expensive (compared to naive implementation) but it is more numerically robust. At the same time for datasets with certain known a priori properties the naive method may work quite well either (e.g. relatively small size dataset, mean isn't too far from 0, etc).

Would you like us to consider such cases either?

The opposite direction is an option either. One-pass method described by Andrey has its own limitations. There are more robust but at the same time more computationally intensive methods for moment estimates. Should we be looking into that either?

Many thanks!
Sergey

tennican · ‎11-10-2011

Hi Sergey,

Actually, yes, if you could.
If MKL could switch from naive to one-pass based on parameters of the data, it would be awesome.
In fact, the current vsldSSNewTask function requires me to specify number of rows.
So, if you switched to the naive method whenever, the number of rows is not large that would be great.
I could code that myself. But, it would be cleaner if it were all build in to MKL.

Also, I don't need this yet.
But, for future reference, how would I set up the computation if I don't yet know how many rows I will be feeding to the vsldSS task?

thanks, scott

Sergey_M_Intel2 · ‎11-10-2011

Hi Scott,

Perfect. That certainly helps us to define the scope of experiments.

But, for future reference, how would I set up the computation if I don't yet know how many rows I will be feeding to the vsldSS task?

One of Summary Statistics features is the possibility to feed data in blocks, so-called streaming data processing. That may help. Andrey may have some comments on that either.

Regards,
Sergey

Andrey_N_Intel · ‎11-11-2011

Hi Scott,
You might want to have a look atSummary Statistics Application Notes, http://software.intel.com/sites/products/documentation/hpc/mkl/sslnotes/index.htm, chapter "Processing Data in Blocks" which describes the usage model for streaming data processing. In Intel MKL Manual, Chapter StatisticalFunctions/Summary Statistics,Section Task Computation Routine you would find the tablelisting estimates that support streaming processing. It's worth noting thatbesidesbasic estimates like variance-covariance or meanthis feature is available for quantiles.
Thanks,
Andrey

tennican · ‎11-11-2011

Great. The example in the documentation will make this easy to implement when the time comes.

thanks, Scott