multithreading vsl

aani · ‎11-17-2009

Helllo,

below is how i usevsl mt2203 in multi threaded mode.

#define samples 2001
#define nstreams 512

clock_t t1,t2;
int main()
{
unsigned int init[4]={0x123, 0x234, 0x345, 0x456};
int n=4,i=0,nThread=0;
omp_set_num_threads(4);
nThread = omp_get_num_threads();

float *r;
r = (float *)malloc(samples*(sizeof(float)));

VSLStreamStatePtr stream[nstreams];

/* Initializing */
for(int k =0;k< nstreams;k++)
vslNewStreamEx( &stream, VSL_BRNG_MT2203+k, n, init );

/* Generating */
t1 = clock();

#pragma omp parallel for
for(int k =0;k< nstreams;k++){
vsRngUniform( VSL_METHOD_SUNIFORM_STD,stream, samples, (float *)r, 0.0f, 1.0f );
}

t2 = clock();
printf(" t in seconds :%f \n",(t2-t1)*0.001);
printf(" samples/sec:%E \n",(samples*nstreams)/((t2-t1)*0.001));

/* Deleting the streams */
for(int k =0;k< nstreams;k++)
vslDeleteStream( &stream );
return 0;
}

with #pragma omp parallel for , entire for loop is distrubuted among the threads,suppose i am checking on the quad core machine,if i set omp_num_threads = 4 and the loop count is 100, each thread will iterate 25 times through the loop right?

Is the above used method is correct to check vsl mt2203 in multi threaded condition?

also how to check MT19937 in multi threaded mode ?

VSLStreamStatePtr stream;

/* Initializing */
vslNewStreamEx( &stream, VSL_BRNG_MT19937, n, init );

/* Generating */
t1 = clock();

vsRngUniform( VSL_METHOD_SUNIFORM_STD,stream, samples, (float *)r, 0.0f, 1.0f );

t2 = clock();
printf(" t in seconds :%f \n",(t2-t1)*0.001);
printf(" samples/sec:%E \n",samples/((t2-t1)*0.001));

/* Deleting the streams */
vslDeleteStream( &stream );

I want all the threads should participate in the execution of vsRngUniform . for this do i need to set mkl_set_num_threads?

i.e. mkl_set_num_threads(4);
vsRngUniform( VSL_METHOD_SUNIFORM_STD,stream, samples, (float *)r, 0.0f, 1.0f );

is this is correct.

Thanks in advance .

aani

Andrey_N_Intel · ‎11-17-2009

Hello,

Intel MKLVSL supports three methods for parallelization of the application on the user's level, Block-Splitting technique, Leapfrogging technique, and family of basic random number generators.
Block-Splitting and Leapfrogging helps to effectively split the original sequence of random numbers into subsequences and to "assign" each subsequence to the separate thread(s). MT2203 and WH are two examples of VSL generators that provide family of BRNGs. In particular, MT2203 BRNG is set of 1024 BRNGs which allows generating up to 1024 of independent streams,while Wichmann-Hill BRNG is set of 273 BRNGs.

Your example correclty demonstrates how to use MT2203 generator to obtain 512 random sequences in multi-threaded environment. However, please make sure, that outputs of different MT2203s are returned into separate memory and do not over-write each other.
If you use 4 threads and expect each thread to generate 512/4=128 random streams you might want to use "static" value for OpenMP* "schedule" clause. You might expect (roughly) ~4x performance speed-up for this piece of the code versus its serial variant.

VSL version of MT19937 BRNG does not support parallelization methods described above. Our suggestion is to use MT2203 BRNG or any other VSL generator which supports parallel computations and meet requirements of your application.

VSL library is not internally threaded with exception of multivariate version of Gaussian generator. For these reasons setting number of threads by means of mkl_set_num_threads() would not impact on behavior of RngUniform generator which remains serial.

More details about support of parallel computations are available in VslNotes at
http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vslnotes.pdf

Please, let me know if this answers your questions. Also, please, feel free to ask more questions if any.

Thanks,
Andrey

aani · ‎11-18-2009

Thanks a lot for your answer.

Regards,
aani

aani · ‎11-22-2009

Hello Andrey,

Below is the sample code for MRG32k3a in multi threaded environment.
Please let me know whether it is correct.

Thanks,
aani

###############################################

#define samples 33554432
#define BLOCKS 16384

clock_t t1,t2;

int main()
{

int n=4,i=0,nThread=0;
int seed =1,N_PER_BLOCK,RAND_SAMPLES;

N_PER_BLOCK = (samples / (BLOCKS)) + (((samples % (BLOCKS)) == 0)? 0 : 1);
RAND_SAMPLES = BLOCKS * N_PER_BLOCK;

float *r = (float *)malloc(N_PER_BLOCK*(sizeof(float)));
float *Rand_VSL = (float *)malloc(RAND_SAMPLES * sizeof(float));

VSLStreamStatePtr stream1;
VSLStreamStatePtr *stream1data = (VSLStreamStatePtr *)malloc(BLOCKS*sizeof(VSLStreamStatePtr));

// Initializing rng
vslNewStream( &stream1, VSL_BRNG_MRG32K3A,seed );

omp_set_num_threads(4);
nThread = omp_get_num_threads();

#pragma omp parallel for
for (i =0;i vslCopyStream(&(stream1data),stream1);
vslSkipAheadStream(stream1,N_PER_BLOCK);

}
// Generating rng
t1 = clock();

#pragma omp parallel for
for (i =0;i vsRngUniform( VSL_METHOD_SUNIFORM_STD,stream1data, N_PER_BLOCK, (float *)r, 0.0f, 1.0f );
for(int j = 0; j < N_PER_BLOCK; j++){
Rand_VSL[i*N_PER_BLOCK+j] = r;
}
}

t2 = clock();
printf(" t in seconds :%f n",((t2-t1))*0.001);
printf(" samples/sec:%E n",(RAND_SAMPLES/((t2-t1)*0.001)));

// Deleting the streams
vslDeleteStream( &stream1);

return 0;
}

Andrey_N_Intel · ‎11-23-2009

Hello Aani,

At first glance, you might want to modify several lines of the code below.

1. When you initialize block-splitting for MRG32K3A BRNG, please, make sure that you skip proper number of random variates. The modified code is below:

for (i = 0; i < BLOCKS; i++)
{
vslCopyStream(&(stream1data), stream1);
vslSkipAheadStream(stream1, i*N_PER_BLOCK);
}

2. When you generate uniformly distributed random numbers, please, make surethe results of the generation in different streams do not over-write each others. So, the suggested change is

for (i = 0; i < BLOCKS; i++)
{
vsRngUniform( VSL_METHOD_SUNIFORM_STD,stream1data, N_PER_BLOCK,&(Rand_VSL[i*N_PER_BLOCK]), 0.0f, 1.0f );
}

No additional data coping is required in this case.

3. Also, when generation is completed, please, delete the streams that are stored in stream1data array

for (i = 0; i < BLOCKS; i++)
{
vslDeleteStream( (&(stream1data) );
}

4. You would need to free memory allocated for needs of your application using the system routine free().

Please, let me know how it works for you.

Best regards,
Andrey

aani · ‎11-23-2009

Hello Andrey,

Thanks for your reply. below are my findings.

1. When you initialize block-splitting for MRG32K3A BRNG, please, make sure that you skip proper number of random variates. The modified code is below:

for (i = 0; i < BLOCKS; i++)
{
vslCopyStream(&(stream1data), stream1);
vslSkipAheadStream(stream1, i*N_PER_BLOCK);
}

Thiswill not work for my code . Skipping i *N_PER_BLOCK elements is required in case if we initialize the stream
inside the for loop ,otherwise it will generate the same numbers in each stream block. Since i initialize the stream outside the for loopthe below loop which i posted earlierwill work fine for me.

/* Initializing */
vslNewStream( &stream1, VSL_BRNG_MRG32K3A,seed );

for (i =0;i vslCopyStream(&(stream1data),stream1);
vslSkipAheadStream(stream1,N_PER_BLOCK);

}

Regards,
aani

Andrey_N_Intel · ‎11-24-2009

Hello Aani,

Inthe code I suggested for initiliazation of the block-splitting the first parameter of vslSkipAheadStream() routine should be properly changed: instead of stream1 parameter we need to use stream1data. Below is the modified code.

for ( i = 0; i < BLOCKS; i++ )
{
vslCopyStream( &(stream1data), stream1 );
vslSkipAheadStream( stream1data, i*N_PER_BLOCK );
}

You also might want to have a look at the example vslskipaheadstream.c which shows usage of the block-splitting technique and is available in examplesvslc directory of MKL tree.

Please, let me know if you have more questions.

Best regards,
Andrey

aani · ‎11-24-2009

Thanks Andrey