- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Can anyone tell me how to fix this, please?

For example, given normal random samples around 2, I should get skewness and excess kurtosis close to zero. However, at a standard deviation of 1.0e-4, Excess kurtosis jumps to -311. At 1.0e-5, skewness jumps to 11. At 1.0e-6, the absolute values of both skewness and kurtosis are gigantic.

In R, I get the correct values much longer.

I have confirmed that I can workaround this problem by copying the data, subtracting the mean and computing the moments with another MKL task.

But, man, this is ugly.

Here is my C++ code for computing summary stats with MKL

------------------------------------------------------------------

// Setup a Summary Statistic task pointing at the data

int status;

VSLSSTaskPtr task;

MKL_INT numRows = m_count;

MKL_INT numCols = 1;

MKL_INT xstorage = VSL_SS_MATRIX_STORAGE_ROWS;

int indices[1] = {1};

status = vsldSSNewTask( &task, &numCols, &numRows, &xstorage, &(m_data[0]), 0, indices );

// Setup 1st, 2nd and 3rd quartile computation

double quants[3];

double q_order[3] = {.25,.5,.75};

MKL_INT q_order_n = 3;

status = vsldSSEditQuantiles( task, &q_order_n, q_order, quants, NULL, NULL );

// Setup max and min computation

double min_est = m_data[0];

double max_est = m_data[0];

status = vsldSSEditTask(task, VSL_SS_ED_MIN, &min_est);

status = vsldSSEditTask(task, VSL_SS_ED_MAX, &max_est);

// Setup moment computation. Computing kurtosis requires all the other moments.

double mean_est, skewness_est, kurtosis_est;

double raw2mom, raw3mom, raw4mom;

double cen2mom, cen3mom, cen4mom;

status = vsldSSEditTask(task, VSL_SS_ED_SKEWNESS, &skewness_est);

status = vsldSSEditTask(task, VSL_SS_ED_KURTOSIS, &kurtosis_est);

status = vsldSSEditMoments( task, &mean_est, &raw2mom, &raw3mom, &raw4mom, &cen2mom, &cen3mom, &cen4mom );

// Request five number summary and moment computation

MKL_UINT64 mask = VSL_SS_MEAN | VSL_SS_KURTOSIS | VSL_SS_SKEWNESS |

VSL_SS_2R_MOM | VSL_SS_3R_MOM | VSL_SS_4R_MOM |

VSL_SS_2C_MOM | VSL_SS_3C_MOM | VSL_SS_4C_MOM |

VSL_SS_QUANTS | VSL_SS_MIN | VSL_SS_MAX;

status = vsldSSCompute(task, mask, VSL_SS_METHOD_FAST);

--------------------------------------------------------------------------

Here is the R code that I used to generate the test data as well and check skewness and kurtosis:

-----------------------------------------------------------------------

library(moments)

setwd("C:/Users/stennican/Documents/Rcode")

N = 500

normDist = rnorm(N)

trial = seq(1,N)

twos = rep(2,N)

twoSd1 = twos + 1e-1*normDist

twoSd2 = twos + 1e-2*normDist

twoSd3 = twos + 1e-3*normDist

twoSd4 = twos + 1e-4*normDist

twoSd5 = twos + 1e-5*normDist

twoSd7 = twos + 1e-7*normDist

twoSd10 = twos + 1e-10*normDist

twoSd13 = twos + 1e-13*normDist

zeros = rep(0,N)

zeroSd1 = 1e-1*normDist

zeroSd2 = 1e-2*normDist

zeroSd3 = 1e-3*normDist

zeroSd4 = 1e-4*normDist

zeroSd5 = 1e-5*normDist

zeroSd7 = 1e-7*normDist

zeroSd10 = 1e-10*normDist

zeroSd13 = 1e-13*normDist

out = data.frame(trial,zeros,twos,

zeroSd1,zeroSd2,zeroSd3,zeroSd4,zeroSd5,

zeroSd7,zeroSd10,zeroSd13,

twoSd1,twoSd2,twoSd3,twoSd4,twoSd5,

twoSd7,twoSd10,twoSd13)

apply(out,2,sd)

apply(out,2,skewness)

apply(out,2,kurtosis)

options(digits = 22)

write.csv(out,"testB42382.csv",row.names=FALSE)

Link Copied

13 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thanks for the post and the detailed description of the issue. We will have a look atthe dataset and the test case on our side. Can you please clarifywhat the value of the m_count variable in your code is? Is it equal to 500?

Thanks,

Andrey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Yes, m_count is 500 because I just imported the data generated in the R code where N = 500.

I believe there is some change in the behavior if you vary sample size from 50 to 500 to 5000.

But, I wasn't testing in an organized fashion at the time. So, I'd have to retest to be sure.

thanks for you help, Scott

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

May I clarify one more thing? Is this behavior reproducible on different random sequences? In other words, isn't it because of large statistical error due to relatively small sample size?

Sergey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Yes, it is reproducible with various random samples of various sizes.

Most would call 500 a large sample since the CLT is working for you big time at that point.

I haven't tried any small samples of less than 30 yet.

Let me know if you need me to do additional testing and send you the results.

thanks, Scott

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

We did the analysis of the datasetsyou work with,computed confidence intervals for sample kurtosis for the dataset from Gaussian distribution, and the output of Summary Stats routines for kurtosis/skewness estimation.

The current conlusion is that it makes sensefor us tounderstand the opportunities for having the algorithms for computation of the moments/covariance which would bemore tolerant to numerical errors to address the case of the datasets with variance several orders smaller than its expectation.

I wonder if the work-around you mention earlier and which sounds like a two pass algorithm would currently work for your needs?

Also, canyouplease specify howoftenyouget the datasets with such structure in your work?

Iwould appreciate if you could also clarify if you work with in-memory datasets only, or data that can have such structure and which does not fit into memory of the computer is also the case?

Thanks,

Andrey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Our tool makes it very easy for people with no statistical background to access any sort of database, transform columns of data from that database and then request summary stats for the column. So, while it doesn't make sense to care about the kurtosis of a variable with almost no variance, we will have to compute this at some point. For a constant, kurtosis is undefined and it is correct to say so. But, for small variance, kurtosis and skewness are well defined and we should report them correctly.

The data on which MKL operates is in an in-memory database.

As I said before, I can work around the problem by first computing the mean of the original data, subtracting the mean from a copy of the data and computing the higher moments using the demeaned data.

The extra cost in time and memory are acceptable in the short term.

How long do you think it will take for MKL to be improved to match the precision of R in this respect?

thanks, Scott

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

We would do the deeper analysis of your requestwithin the next several weeks. Once the analysisis completedwe would understand the next steps.

Does it sound reasonable for you?

Thanks,

Andrey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Sure, I'm not in a huge rush.

The workaround only triggers in this very special case.

So, it won't impact performance.

thanks, Scott

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

One more clarification to be 100% sure that we address your needs.

You indicated that majority of your customers are not experts in statistics so you would like to provide basic statistics capabilities as simple (to use) as possible. Does it mean that you want to avoid giving customers some controls to let your software know about properties of incoming dataset?

For example, Andrey gave you an example of the one-pass method for the mean estimate, which is rather expensive (compared to naive implementation) but it is more numerically robust. At the same time for datasets with certain known a priori properties the naive method may work quite well either (e.g. relatively small size dataset, mean isn't too far from 0, etc).

Would you like us to consider such cases either?

The opposite direction is an option either. One-pass method described by Andrey has its own limitations. There are more robust but at the same time more computationally intensive methods for moment estimates. Should we be looking into that either?

Many thanks!

Sergey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Actually, yes, if you could.

If MKL could switch from naive to one-pass based on parameters of the data, it would be awesome.

In fact, the current vsldSSNewTask function requires me to specify number of rows.

So, if you switched to the naive method whenever, the number of rows is not large that would be great.

I could code that myself. But, it would be cleaner if it were all build in to MKL.

Also, I don't need this yet.

But, for future reference, how would I set up the computation if I don't yet know how many rows I will be feeding to the vsldSS task?

thanks, scott

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Scott,

Perfect. That certainly helps us to define the scope of experiments.

One of Summary Statistics features is the possibility to feed data in blocks, so-called streaming data processing. That may help. Andrey may have some comments on that either.But, for future reference, how would I set up the computation if I don't yet know how many rows I will be feeding to the vsldSS task?

Regards,

Sergey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

You might want to have a look atSummary Statistics Application Notes, http://software.intel.com/sites/products/documentation/hpc/mkl/sslnotes/index.htm, chapter "Processing Data in Blocks" which describes the usage model for streaming data processing. In Intel MKL Manual, Chapter StatisticalFunctions/Summary Statistics,Section Task Computation Routine you would find the tablelisting estimates that support streaming processing. It's worth noting thatbesidesbasic estimates like variance-covariance or meanthis feature is available for quantiles.

Thanks,

Andrey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

thanks, Scott

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page