Solved: Problem with geration method in random number generatior

stanleyimko · ‎07-27-2010

Hi, I am doing a linear regression simulation but found something confusing. I generate 1000 observations with the following design.

True model:

Model1: y_1 = x_1 + 0.9 * x_2 + 0.7 * x_3 + e_1
Model2: y_2 = x_1 + 0.9 * x_2 + 0.7 * x_3 + e_2

where x_1, x_2, x_3 following Uniform distribution in (0, 1), e_1 and e_2 are both Gaussian(0, 1).

I simulate e_1 and e_2 with different methods namely VSL_METHOD_DGAUSSIAN_BOXMULLER and VSL_METHOD_DGAUSSIAN_ICDF.

vdRngGaussian ( VSL_METHOD_DGAUSSIAN_BOXMULLER, stream, 1000, e_1, 0.0, 1.0 );

vdRngGaussian ( VSL_METHOD_DGAUSSIAN_ICDF, stream, 1000, e_2, 0.0, 1.0 );

And then I run two regressions. The first model gives a good estimate but the second one is very poor. Just a few trails:

Model1: 0.911859 0.995013 0.712112
Model2: 3.31478 -0.116816 - 0.232942

Model1: 0.945507 0.89906 0.746299
Model2: 3.40879 -0.149013 -0.306843

Model1: 1.10576 0.825391 0.694707
Model2: 3.32135 -0.122911 -0.276134

At the very beginning I use ICDF but the bias is very serious. It costs me a lot of time to figure out the source of the problem. But I still don't understand why it is so sensitive to the generating method. Thanks.

Andrey_N_Intel · ‎07-28-2010

Hi Stanley,

You initialize and remove the random stream immediately before and after the generation of random numbers. Output of the function time() is used to set the seed for initialization of the random stream. This could result in the same seed in the generation of both uniform and Guassian variates, the same uniform variates underneath Gaussian numbers and biased results.

May I suggest you to make initialization of the VSL random stream once, in the constructor of class sim? Deallocation of VSL RNG related resources can be done once in the destructor of the class. Thus, methods gaussian_1(), gaussian_2() and uniform() of the class sim would be just wrappers around MKL random number generators. Alternatively, you could use different seeds for generation of Gaussian and uniform numbers.This should help you to avoid potentially biased results in your simulations.

Please, let me know if this answers your question.

Thanks,
Andrey

View solution in original post

Andrey_N_Intel · ‎07-27-2010

Hello,

Can you please provide more details for your linear regression simulations:

1. Which Intel MKL version do you use? Which OS and mode, IA-32 or Intel 64?
2. How do you link against MKL?
3. Which basic random number generator(s) is(are) used for the simulations?
4. Do you provide the same sequence of uniform random numbers into both models, so the models are different in just source of Gaussian numbers?
5. Whatare the sample mean/other sample estimates in both models?Are theyclose to the theoretical?

Ideally, a short test case (the extract from your simulation application)that demonstrates the issue would be sufficient.

Thanks,
Andrey

stanleyimko · ‎07-27-2010

Hi, Andrey,

I am using MKL10.2 with Intel64, Intel C++ compiler 11.1, and my OS is OpenSUSE 11.2. I compile with the following link

-L$MKLPATH $MKLPATH/libmkl_solver_lp64.a -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -openmp -lpthread

and the following random number generator

Uniform: vdRngUniform ( VSL_METHOD_DUNIFORM_STD_ACCURATE, stream, 1000, x, 0.0, 1.0 );
Gaussian: vdRngGaussian ( VSL_METHOD_DGAUSSIAN_BOXMULLER, stream, 1000, e_1, 0.0, 1.0 );
vdRngGaussian ( VSL_METHOD_DGAUSSIAN_ICDF, stream, 1000, e_2, 0.0, 1.0 );

I use the same sequence of x's in two equations. all the sample mean and sd are consistent with the setting.

I just come up with the idea that ICDF is the inverse transformation form the uniform distribution. So my x's would be correlated with the e generating by ICDF. That's why the regression will be bias.

Thanks,

Stanley KO

Andrey_N_Intel · ‎07-27-2010

Hi Stanley,

Thank you much for the additional details. I still have a few questions below to clarify.
To my understanding, your computational scheme is as follows (please feel free to correct or/and add more details):

1. Create random stream

errcode = vslNewStream( &stream, BRNG, SEED );
Can you please specify value for BRNG and SEED parameters?

2. Generate random numbers

vdRngUniform ( VSL_METHOD_DUNIFORM_STD_ACCURATE, stream, 1000, x, 0.0, 1.0 );
vdRngGaussian ( method, stream, 1000, e, 0.0, 1.0 );

where method is VSL_METHOD_DGAUSSIAN_BOXMULLER or VSL_METHOD_DGAUSSIAN_ICDF and e is e_1 or e_2.

Both BoxMuller and ICDF methods transform uniform numbers to produce Gaussian sequence.
If you use pseudo-random number generator like MT19937 you would not expect that uniform numbers x_i and uniforms underneath Gaussian generator are correlated, and thus, no bias should be in your simulations. On the other hand, if you use SOBOL quasi-random number generator as the source of uniform random numbers generally you can run into correlated variates.
For those reasons it could be important to understand which BRNG and seed are used in the simulations.

3. Do regression computations.
According to the equations you provided you need 3 uniform numbers and one normal variate per one output. Does it make sense to generate 3000 uniforms and 1000 Gaussian numbers to obtain 1000 model outputs?

4. Delete stream

errcode = vslDeleteStream( &stream );

Thanks,
Andrey

stanleyimko · ‎07-27-2010

Dear Angrey,

I may post my source code. I have two files. One defines the class for random number generation and the other is the main file.

/******************CLASS FOR RANDOM NUMBER GENERATION*******************/
#include
#include "mkl.h"

#define BRNG VSL_BRNG_MT19937

using namespace std;

class sim
{
private:

VSLStreamStatePtr stream;

public:

int error_code;

void gaussian_1 ( int, double *, double, double );

void gaussian_2 ( int, double *, double, double );

void uniform ( int, double *, double, double );
};

void sim::gaussian_1 ( int N, double * r, double mean, double sigma ) // gaussian with BOXMULLER
{
vslNewStream( &stream, BRNG, time(0) );

error_code = vdRngGaussian ( VSL_METHOD_DGAUSSIAN_BOXMULLER, stream, N, r, mean, sigma );

vslDeleteStream ( &stream );
}

void sim::gaussian_2 ( int N, double * r, double mean, double sigma ) // gaussian with ICDF
{
vslNewStream( &stream, BRNG, time(0) );

error_code = vdRngGaussian ( VSL_METHOD_DGAUSSIAN_ICDF, stream, N, r, mean, sigma );

vslDeleteStream ( &stream );
}

void sim::uniform ( int N, double * r, double a, double b)
{
vslNewStream( &stream, BRNG, time(0) );

error_code = vdRngUniform ( VSL_METHOD_DUNIFORM_STD_ACCURATE, stream, N, r, a, b );

vslDeleteStream( &stream );
}

/*************************THE MAIN FILE***************************/

#include "./sim.h"

#define N 1000

int main()
{
double mean = 0.0;
double sigma = 1.0;

double a = 0.0;
double b = 1.0;

double e_1;
double e_2;

sim eps;

eps.gaussian_1 ( N, e_1, mean, sigma ); // generate e_1
eps.gaussian_2 ( N, e_2, mean, sigma ); // generate e_2

double x_1;
double x_2;
double x_3;

double * tem_x;
tem_x = new double[3*N];

eps.uniform ( 3*N, tem_x, a, b ); // generate x_1, x_2, x_3

for ( int i = 0; i < N; i++)
{
x_1 = tem_x;
x_2 = tem_x[N+i];
x_3 = tem_x[(2*N)+i];
}
delete[] tem_x;

double y_1;
double y_2;

for( int i = 0; i < N; i++ )
{
y_1 = x_1 + 0.9 * x_2 + 0.7 * x_3 + e_1; // model 1
y_2 = x_1 + 0.9 * x_2 + 0.7 * x_3 + e_2; // model 2
}

return 0;
}

Thanks for your time.

Stanley KO

Andrey_N_Intel · ‎07-28-2010

Hi Stanley,

You initialize and remove the random stream immediately before and after the generation of random numbers. Output of the function time() is used to set the seed for initialization of the random stream. This could result in the same seed in the generation of both uniform and Guassian variates, the same uniform variates underneath Gaussian numbers and biased results.

May I suggest you to make initialization of the VSL random stream once, in the constructor of class sim? Deallocation of VSL RNG related resources can be done once in the destructor of the class. Thus, methods gaussian_1(), gaussian_2() and uniform() of the class sim would be just wrappers around MKL random number generators. Alternatively, you could use different seeds for generation of Gaussian and uniform numbers.This should help you to avoid potentially biased results in your simulations.

Please, let me know if this answers your question.

Thanks,
Andrey

stanleyimko · ‎07-28-2010

Dear Andrey,

Thanks so much for your correction. I define a constructor and a destructor and everything looks fine. But I still don't understand why BOXMULLER will work but ICDF doesn't.

Thanks so much.

Regards,

Stanley KO

Andrey_N_Intel · ‎07-28-2010

Hi Stanley,

The output of the uniform generator is u= (u(i)), i=1,...,3*n. If the random stream is initialized with the same seed the same unifrom variates, n in case of ICDF and 2*n in case of Box-Muller would be used in Gaussian generator.

If ICDF method is used for generation of normal variates then the output of the Gaussian generator x=(x(i)), i=1,...,n, is just a one to one transformation of the same sequence u: x(i) = sqrt(2) erfinv( u(i) ) (section 9.3.4 of VSLNotes). In this case the simulations would result in biased results.

Box-Muller transformation is more complex and requires a pair of uniform variates to produce one Gaussian number:
x(i) = sqrt( -2 ln( u(2*i) ) ) sin (2 * pi * u(2*i+1) ) (section 9.3.2 of VSLNotes).

Let us consider your regression scheme in more details. To produce regression output you need 3 uniforms and 1 normal number that is, according to your testcase

y(0) needs (u(0), u(N+0), u(2N+0), x(0) ), x(0) uses ( u(0), u(1) )
y(1) needs (u(1), u(N+1), u(2N+1), x(1) ), x(1) uses ( u(2), u(3) )
y(2) needs (u(2), u(N+2), u(2N+2), x(2) ), x(2) uses ( u(4), u(5) )
y(3) needs (u(3), u(N+3), u(2N+3), x(3) ), x(3) uses ( u(6), u(7) )
etc

Thus, in case of Box-Muller transformation you would expect that the results are biased in "less degree", but the dependence between y(i)looks to be possible. So, it really makes sense to avoid the usage model of the test case that potentially can results in biased simulation outputs.

Please, let me know if this helps.

Thanks,
Andrey