Solved: VSL Summary Statistics skewness and kurtosis are incorrect

tennican · ‎07-01-2011

I computed mean, variance, skewness and kurtosis for a test vector.
The mean was correct and the variance was just a little off.
But, the MKL estimates for skewness and kurtosis totally disagree with the R moments package.

R says:
> skewness(X)
[1] 2.400606
> kurtosis(X)
[1] 14.15447

MKL says:
skewness_est = -3.9303471475184684
kurtosis_est = 5.7271436907737581

Can anyone see what I did wrong in the code below:

int status;
VSLSSTaskPtr task;
MKL_INT xstorage = VSL_SS_MATRIX_STORAGE_ROWS;
MKL_INT numCols = 1;
MKL_INT numRows = 1419;
int indices[1] = {1};
status = vsldSSNewTask( &task, &numCols, &numRows, &xstorage, &(X[0]), 0, indices );
double mean_est = X[0];
double raw2mom = X[0];
double raw3mom = X[0];
double raw4mom = X[0];
double variance_est = X[0];
double cen3mom = X[0];
double cen4mom = X[0];
double skewness_est = X[0];
double kurtosis_est = X[0];
status = vsldSSEditTask(task, VSL_SS_ED_SKEWNESS, &skewness_est);
status = vsldSSEditTask(task, VSL_SS_ED_KURTOSIS, &kurtosis_est);
status = vsldSSEditMoments( task, &mean_est, &raw2mom, &raw3mom, &raw4mom,
&variance_est, &cen3mom, &cen4mom );
status = vsldSSCompute(task, VSL_SS_MEAN|VSL_SS_2C_MOM|VSL_SS_SKEWNESS|VSL_SS_KURTOSIS,
VSL_SS_METHOD_FAST);

X is composed of the attached data

mecej4 · ‎07-02-2011

You mixed up the sample variance and the second order central moment. The two, while being closely related, are not identical

I believe that there is a typographical error in the value reported to be given by R for the kurtosis.

Here is a complete program to read the data file and print the statictics (suggestion: put the data into a file and attach the file to your post rather than putting the voluminous data inline).

[cpp]#include 
#include 

int main(int argc,char *argv[]){        
  int status; double *X; int i;
  VSLSSTaskPtr task;
  MKL_INT xstorage = VSL_SS_MATRIX_STORAGE_ROWS;
  MKL_INT numCols = 1, numRows = 1419;
  int indices[1] = {1};
  unsigned MKL_INT64 mask = 0;
  
  X=(double *)malloc(numRows*sizeof(double));
  for(i=0; i With this program, I get

[bash]s:lang>icl /Qmkl /Ot stats.cpp

s:lang>stats < dat
mean =  1.0505e+004,  var =  1.0115e+000
skew =  2.3981e+000, kurt =  1.1135e+001

[/bash]

View solution in original post

mecej4 · ‎07-02-2011

You mixed up the sample variance and the second order central moment. The two, while being closely related, are not identical

I believe that there is a typographical error in the value reported to be given by R for the kurtosis.

Here is a complete program to read the data file and print the statictics (suggestion: put the data into a file and attach the file to your post rather than putting the voluminous data inline).

[cpp]#include 
#include 

int main(int argc,char *argv[]){        
  int status; double *X; int i;
  VSLSSTaskPtr task;
  MKL_INT xstorage = VSL_SS_MATRIX_STORAGE_ROWS;
  MKL_INT numCols = 1, numRows = 1419;
  int indices[1] = {1};
  unsigned MKL_INT64 mask = 0;
  
  X=(double *)malloc(numRows*sizeof(double));
  for(i=0; i With this program, I get

[bash]s:lang>icl /Qmkl /Ot stats.cpp

s:lang>stats < dat
mean =  1.0505e+004,  var =  1.0115e+000
skew =  2.3981e+000, kurt =  1.1135e+001

[/bash]

Andrey_N_Intel · ‎07-02-2011

Hello,
Thanks for the communication.
Please, give us some time to reproduce your results;we will then return back to you.
Best,
Andrey

Andrey_N_Intel · ‎07-04-2011

Hello,

We reproduced your results and compared MKLversion of simple statistics against R package "Moments". Below is the summary of the observations and comments:

1.In computation of variance R package uses normalization n, where n is size of dataset x = (x(i)), i=1,..,n. MKL uses normalization n-1 which provides unbiasedness of the estimate. To be more more precise, MKL uses B=W-sum (w(i)^2) / W, W= sum (w(i)) where w(i) is weight assigned to observation x(i). With all weights w(i) set to 1 B is equal to n-1.

2.MKLversion of kurtotis is based on the algorithm K(x) = C4(x) / V(x) ^ 2 - 3where C4 is the central moment of 4th order, V(x) - variance. R package basically uses the same formula but without term "-3". Term "-3" is set to have thisestimate equal to zero for Gaussian distribution.
MKL versions ofcentral moment of 4th order and variance use normalizations n and n-1, correspondingly, while R uses normalization n in both cases.This can contribute insome numerical differences between R and MKLestimates of kurtosis.

Please, see Section "Mathematical Notation and Definitions" of Chapter Statistical Functions/ Summary Statistics in Intel MKL Manual for the additional detailsbehind the algorithms of Summary Stats in MKL.

3.Theimplementation ofkurtosis in MKL contains the bug which appears in the settings of the test case in the first post. Welocalized and fixed it. The fix will be available inone of future releases of the library. Currently, the work-around for thebug is to request "simultaneous" computation of kurtosis and central moments of 2nd and 4th order.

Please, let me know if this answers yourquestions.
Please, feel free to submit morequestions/feedbacks on Statistical Component of MKL,and we will help.

Best,
Andrey

tennican · ‎07-05-2011

Thanks very much. You solved my problem.

I think the main problem was that I hadn't requested the computation of all the precursors to skewness and kurtosis which you do correctly in your program and Audrey mentions is necessary below.
And R reports kurtosis rather than excess kurtosis as Audrey says.

Also, when I export the correct data to validate in R, sd(sales) in R equals sqrt(cen2mom) from MKL.

The text below is crossed out to avoid confusion:
I'm not so sure about the variance being different from the 2nd central moment.
I think you thought I wanted to compute the coefficient of variation which is quite different from variance.

There are small differences in the estimates of R var and MLK cen2mom.
However, R does report the unbiased not the biased estimate of variance.
So, the difference is not due to that.
I computed the biased and unbiased variance by hand below and the unbiased one is exactly var(X).
Also, I computed the variance from the coefficient of variation given by MKL and got a number
quite close to the cen2mom computed by MKL as you would expect.

Anyway, the estimates are close enough for my current purpose. So, I'm happy.

> var(X)
[1] 112904318
> N = length(X)
> myvar = sum((X-mean(X))^2)/N
> myvar
[1] 112824752
> myvar*N/(N-1)
[1] 112904318
>
> cen2mom = 112904462.89915065
> variation_est = 1.0114540252081452
> (variation_est * abs(mean(X)))^2
[1] 112904489

Andrey_N_Intel · ‎07-06-2011

Hi Scott,

We ranthe test case with your data using double version of Summary Statistics functions on Nehalem CPU, Linux, in both IA-32 and Intel 64modesand obtained the followingoutput for mean and variance with MKL 10.3 (I do not provide the numbers forother estimates):

Mean = 10505.324172 Variance = 112904318.274247

That is, the value of variance is aligned with R output.
On the other hand, you communicate a different value: cen2mom = 112904462.89915065.
It might make sense to sync-up the results obtained on your and our sides.

Can you please clarify your environment:
1. MKL version
2. OS
3. CPU
4. Mode, IA-32 or Intel 64
5. Build/link line
6. Precision, double of float

Thanks,
Andrey

tennican · ‎07-06-2011

Hi Andrey,

I'm currently building within a larger project.
I did briefly try to build this example standalone but didn't want to waste time figuring out how to get around the fact that I don't want to install libiomp5md.dll on my computer.

### Environment
Using MKL 10.3 - ia32 and double precision 'd' functions
In C++ in MSVS 2010
Windows 7 Professional SP1 - 64-bit
Intel Xeon CPU dual-proc E5520 @ 2.27GHz

### To build
### I copied libiomp5md.dll and libiomp5md.lib into the mkl/lib/ia32 directory and used:

mkl_rt.lib

$(mkl-root)\include

$(mkl-root)\lib\ia32

$(mkl-include)%(AdditionalIncludeDirectories)

$(IntelMklLib);%(AdditionalDependencies)

$(mkl-lib)%(AdditionalLibraryDirectories)

thanks for your help, Scott

tennican · ‎07-06-2011

Hi Andrey,

The discrepancy is due to an data export error.
I just realized that the data I was exporting to validate in R was rounded to dollars.
Now that I have fixed that, I get exactly the same variance in R as with MKL.

sorry for the confusion, Scott

Andrey_N_Intel · ‎07-06-2011

Hi Scott,
It's good to know that we resolved the issue.
Please, feel free to askquestions on Statistical Component of Intel MKL, weare ready to help.
Best,
Andrey

mecej4 · ‎07-07-2011

To avoid confusing readers of this thread, and if you do not mind doing so, please edit your earlier post and make necessary corrections. The overstrike button may be used to cross-out incorrect parts.

Thank you.

Gennady_F_Intel · ‎09-07-2011

quote "3.Theimplementation ofkurtosis in MKL contains the bug which appears in the settings of the test case in the first post. Welocalized and fixed it. The fix will be available inone of future releases of the library. ....."

The problem has been already fixed in the latest version MKL 10.3.Update 6 available now for download.

The official anounce of this release you can find here.

--Gennady