Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library
- MKL Summary Statistics computation of the mean of constant data is off

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

tennican

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-08-2011
05:30 PM

175 Views

MKL Summary Statistics computation of the mean of constant data is off

I would expect it to compute 2.0000000000000000

as does the by hand implementation below executing on the same data in the same place.

--------

double testMean = 0.0;

for ( UINT idx = 0; idx < m_data.size(); idx++ ) {

testMean += m_data[idx];

}

testMean /= m_data.size();

Link Copied

12 Replies

Andrey_N_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-09-2011
05:21 AM

175 Views

MKL Summary Statistics algorithm for computation of mean is based onone-pass/on-line schemeM(n+1) = 1/(n+1) * (n * M(n) + X(n+1)), where X(n) - n-th observation, M(n) - mean estimate for dataset X(1),..., X(n),... Generally, the parameters in this formula are not exactly represented in FP arithmetic (like coefficient 1/(n+1) = 1/3 for n=2),thus, you see the result which differs from the ideal one by ~10^(-15) (what isaligned with double FP error).On the other hand, the intention behind the algorithm - have more stable resultsfor various types of the datasets, not just the the absolutely exact for thissimple dataset.

We also need to pay attention to the following important fact: in some (not all) applications statistical error dominates numerical error; if we estimate the mean fora realistic dataset (which contains non constant "signal"), stat errorwould have theorder ~1/sqrt(500)=~0.044 whichmakesnumerical error 10^(-15) "absolutely negligible".

Does itanswer your question?

Thanks,

Andrey

SergeyKostrov

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-09-2011
07:06 AM

175 Views

Hi Andrey,

Look, here is a modifiedtest for '**float**' ( single-precision )and '**double**' ( double-precision )data types:

/////////////////////////////////////////////////////////////////////////////////

float fData[500] = { 0.0f };

double dData[500] = { 0.0L };

float fTestMean = 0.0f;

double dTestMean = 0.0L;

UINT idx;

for( idx = 0; idx < 500; idx++ )

{

fData[idx] = 2.0f;

dData[idx] = 2.0L;

}

for( idx = 0; idx < 500; idx++ )

{

fTestMean += fData[idx];

}

fTestMean /= ( float )500;

for( idx = 0; idx < 500; idx++ )

{

dTestMean += dData[idx];

}

dTestMean /= ( double )500;

printf( "Mean for 'Float test-case' = %.32f\n", fTestMean );

printf( "Mean for 'Double test-case' = %.32f\n", dTestMean );

/////////////////////////////////////////////////////////////////////////////////

Mean for 'Float test-case' = 2.00000000000000000000000000000000

Mean for 'Double test-case' = 2.00000000000000000000000000000000

/////////////////////////////////////////////////////////////////////////////////

As you see resultsfrom **single-precision** ( float \24-bit ) test-case and **double-precision** ( double \53-bit ) test-case are identical! I'm absolutely not convinced with your explanation:

>>...~10^(-15) (**what isaligned with double FP error**)...

There are no miracles in my test-casebecause in **IEEE 754 standard** number **2.0** hasexact binary representations:**For single-precision:**

and

Best regards,

Sergey

Andrey_N_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-09-2011
08:05 AM

175 Views

Thanks for your inputs.

My comments above are related not to the straightforward summation followed by division but to the one-pass scheme for computation of the mean M(n+1) = 1/(n+1) * (n * M(n) + X(n+1)), n=1,...,N(N=500) which is a different algorithm and whichgenerally can return the result slightly different from 2.0 in that particular case; this is what Scott sees with MKL Summary Stats.

In your test case you just accumulate sum of '2' (which has exactrepresentation)and then normalize it by N=500. The resulthas the exact representation until the number of summands is less or equal to the specifc threshold defined by data type - for single precision it is equal to 2^24. Once you exceed the threshold the estimate for the mean should degrade: if N=2^25 and the computations are done in single precision, the output of summation followed by the division is 1.0 and outputof the one pass algorithm is ~2.0.

Does it explain the difference?

Thanks,

Andrey

mecej4

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-09-2011
09:12 AM

175 Views

The MKL algorithm, as he explained it, is a **one-pass** algorithm for computing not only the mean but other statistics such as the standard deviation.

Another way of looking at the MKL calculation of the mean is as an **update** algorithm for use with streaming data. As new data become available, the statistics are updated. As a side effect of how this calculation is done, as early as after taking in the third data point, the result is no longer exact, since 1/3 is not exact in floating point.

For real data, again as Andrey explained, the errors arising from the finite precision of floating point are negligible compared to the noise in the data. Therefore, the failure of MKL to deliver an exact result for a case where a different algorithm can yield an exact result is not a significant deficiency.

tennican

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-09-2011
11:05 AM

175 Views

Thanks for your explanation. I understand that the discrepancy is unavoidable for the single pass method.

Of course, when you display results for special cases like this to unsophisticated users you need to give them what they expect. In this case, formatting the result to 14 decimal places does the job.

The reason why I noticed is that the slight discrepancy in the mean messed up the workaround code described in my previous thread for the special case of constant data.

This code below results in skewness_est: -1.#IND and kurtosis: kurtosis_est = 21530895567645716.

However, for this case both skewness_est and kurtosis_est should be: -1.#IND

And, this is in fact the result when I replace the mean_est with the exact mean which gives data of all zeros

None of this is a real problem because I can just interpret huge values of kurtosis as an error return.

thanks for your help, Scott

----------------------------

if ( cen2mom == cen2mom && (cen2mom < 1.0e-14 || abs(trueMean) > 1.0e8 * cen2mom) ) {

vector

for ( int idx = 0; idx < m_count; idx++ ) {

data.push_back( m_data[idx] - trueMean );

}

status = vsldSSNewTask( &task, &numCols, &numRows, &xstorage, &(data[0]), 0, indices );

// Setup moment computation. Computing kurtosis requires all the other moments.

status = vsldSSEditTask(task, VSL_SS_ED_SKEWNESS, &skewness_est);

status = vsldSSEditTask(task, VSL_SS_ED_KURTOSIS, &kurtosis_est);

status = vsldSSEditMoments( task, &mean_est, &raw2mom, &raw3mom, &raw4mom, &cen2mom, &cen3mom, &cen4mom );

// Request moment computation

mask = VSL_SS_MEAN | VSL_SS_KURTOSIS | VSL_SS_SKEWNESS |

VSL_SS_2R_MOM | VSL_SS_3R_MOM | VSL_SS_4R_MOM |

VSL_SS_2C_MOM | VSL_SS_3C_MOM | VSL_SS_4C_MOM;

status = vsldSSCompute(task, mask, VSL_SS_METHOD_FAST);

// Remove the task

status = vslSSDeleteTask( &task );

}

SergeyKostrov

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-09-2011
04:50 PM

175 Views

Do you calculate it as Biased or Unbiased? Could I specify it?

Best regards,

Sergey

Andrey_N_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-09-2011
10:30 PM

175 Views

As specified in Intel MKL Manual (Chapter Statistical Functions/Summary Statistics, section Math Notations and Definitions) and in Summary Statistics Application Notes (Chapter Algorithms and Interfaces in Summary Statistics, Section Calculating Multiple Estimates) the library provides unbiased estimates for expectation, variance, variance-covariance matrix, and biased estimates for central moments of 3d and 4th order.

Thanks,

Andrey

tennican

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-10-2011
11:04 AM

175 Views

I believe the vsldSS task allows us to request the second central moment

which is the population variance or the biased sample variance.

But, you can easily convert to the unbiased sample variance by multiplying by N/(N-1)

where N is sample size.

Scott

Andrey_N_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-10-2011
10:00 PM

175 Views

Just to make sure that we are on the same page: Intel MKL Summary Stats returns unbiased variance.

If for some reasonswe need the biasedestimate, we can get itby multiplying by (N-1)/N.

Thanks,

Andrey

tennican

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-11-2011
02:46 PM

175 Views

I just tested and you are absolutely right.

Requesting VSL_SS_2C_MOM, I get the unbiased sample variance.

The 2C_MOM in the name led me to think second central moment which usually refers to the biased sample variance.

thanks for straightening me out, Scott

SergeyKostrov

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-14-2011
07:04 AM

175 Views

I've spentextra time on the issue and I wanted to reproducea rounding problem ( or precision loss ) with the Meanin a simple C test-case.

On http://en.wikipedia.org/wiki/Algorithms_for_calculating_variancethere are 5algorithmswith pseudo-codesto calculate the Variance.I've implementedfour of them in C to calculate theStandard Deviation.

The most interesting thing isthat in case of One-Pass\Online algorithm I could not reproduce the problem!

The Mean for a data set with 500values of 2.0 was 2.0.

Take a look at my test-case for One-Pass algorithm and let me know if you're interested to get a complete test-case with 4 algorithms ( Naive, Two-Pass, Compensated and One-Pass )

...

/*

...

//#define _RTTYPEfloat

#define _RTTYPEdouble

//#define _RTTYPElong double

RTint iN = 0;

RTint i = 0;

_RTTYPE dX[500] = { 0.0 };// Data Set 8

iN = 500;

for( i = 0; i < iN; i++ )

dX* = 2.0;*

RTint iB = 0; // 0 - UnBiased\1 - Biased

...

// **Sub-Test 5 - One-Pass Variance & StdDev**

{

/*

Def **One_Pass_Variance**( Data ):

N = 0

Mean = 0

M2 = 0

Delta = 0

for x in data:

N = N + 1

Delta = x - Mean

Mean = Mean + ( Delta / N )

M2 = M2 + Delta * ( x - Mean )# This expression uses the new value of Mean

Variance = M2 / ( N - B )

return Variance

*/

///*

_RTTYPE dVariance = 0.0;

_RTTYPE dStdDev = 0.0;

_RTTYPE dMean = 0.0;

_RTTYPE dM2 = 0.0;

_RTTYPE dDelta = 0.0;

for( i = 0; i < iN; i++ )

{

dDelta = dX* - dMean;dMean = dMean + ( dDelta / ( i+1 ) );dM2 = dM2 + dDelta * ( dX*

CrtPrintf( RTU("One-Pass StdDev: %.24f\n"), ( _RTTYPE )dStdDev );

CrtPrintf( RTU(" Mean: %.24f\n"), ( _RTTYPE )dMean );

CrtPrintf( RTU("\n") );

//*/

}

...**Results:**

Naive StdDev: 0.000000000000000000000000

Mean: 2.000000000000000000000000

Two-Pass StdDev: 0.000000000000000000000000

Mean: 2.000000000000000000000000

Compensated StdDev: 0.000000000000000000000000

Mean: 2.000000000000000000000000

One-Pass StdDev: 0.000000000000000000000000

Mean: 2.000000000000000000000000

Fabio_G_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-23-2014
12:00 AM

175 Views

@Sergey

Sorry but what would you expect from your code? You are summing TWO! Try with normally distributed random number and as Andrey said your error in computing mean and variance will scale as sqrt(nsamples).

I'm running Montecarlo simulations and the issue about normal RNG (mean and variance accuracy) can be destructive on fine structure of the physics. The solution of increasing sample space can be viable but sometime (depending on the physics of the system) can be even worst. A better choice would be to reject a random realization based on a threshold on its mean and/or variance, although VERY computationally expensive.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.