Trouble converting routine to use parallel RNG

james_B_8 · ‎09-25-2013

Hello

I have a code where I need to initialise a massive array with random numbers. For increasing problem size, this part of the program is becoming a drag. I want to use Intel VSL to do this, but I'm having trouble meshing the current problem with the indended use of the VSL routines.

Code:

[cpp]

        for (i = 0; i < N1; i++)
        {
            for (j = 0; j < N2; j++)
            {
                for (k = 0; k < N3; k++)
                {
                    DP0 = 0.0;
                    P0 = drand48() * 2.0 - 1.0;
                }
            }
        }
[/cpp]

This is my first attempt to implement it in parallel:

[cpp]

        /* Implement parallel RNG with Intel VSL */
        VSLStreamStatePtr* stream = (VSLStreamStatePtr *) malloc(nthreads * sizeof(VSLStreamStatePtr));

        // Initialise the streams
        for( i=0; i<nthreads; i++)
        {
            vslNewStream(&stream, VSL_BRNG_MT2203+i, SEED);
        }

        double t0 = omp_get_wtime();

        #pragma omp parallel for private(i,j,k,l)
        for (i = 0; i < N1; i++)
        {
            for (j = 0; j < N2; j++)
            {
                for (k = 0; k < N3; k++)
                {
                    DP0 = 0.0;
                    vdRngUniform( METHOD, stream[omp_get_thread_num()], 1, &P0, -1.0, 1.0 );
                }
            }
        }
[/cpp]

Two things are troubling me:

1. For test case of N1=N2=N3=512. The original with dran48() runs in 1.5s single threaded. The new version runs in 5s in seral. It scales with more threads. Now is this serial time difference to be expected since I'm using another BRNG? I have no idea about relative performance between the two.

2. VSL is about vectors! My implementation is scalar because I also want to initialise DP0 to 0.0 at the same time. Can the intel compiler vectorise the k loop so that the DP0 and P0 are vectorised? The vec-report is complaining about non-existance flow dependencies between DP0 and P0.

Thanks.
James

Andrey_N_Intel · ‎09-25-2013

Hello James,

The algorithms which are used underneath of Mersenne Twister and rand48 are different, and this is one reason of serial time difference. Another important reason is related to implementation of the algorithms.

The chart available at http://software.intel.com/sites/products/documentation/doclib/mkl_sa/111/vsl/functions/uniform_cont.htm shows how performance of Intel(R) MKL Uniform RNG depends on the vector length. To get additional performance gain from Statistical Component of MKL your code needs to rely on vector, not scalar way of MKL RNG use.

Appoaches to modification of the code are defined by requirements of your application, compute environment, and problem sizes. One quick way to use vector API of RNG might be as follows. The most inner loop can be split:

for ( k = 0; k < N3; k++ ) { DP0 = 0.0;}

vdRngUniform( METHOD, stream[], N3, &P0[0], -1.0, 1.0);

If N3 is say ~100,000 it would make sense to have additional loop for generation of the array of random numbers in chunks of size, say ~1000 or so, instead of one call to RngUniform with vector length 100,000.

If N3 is ~ a few dozens, it would make sense to consider other schemes to implementation of the whole loop.

Other approaches might relate to the requirement of reproducibility of the content of array P0 from run to run and specifics of the further use of random numbers. General high level suggestion is to process the next block of random numbers once it is returned by the generator, instead of generating and storing random numbers in the array, which would be processed somewhen later. This would help to improve data locality and perfomance. Whether data layout of arrays DP0 and P0 supports vectorization of the further related loops is another aspect to pay attention to.

Please, let me know, if this answers your question.

Andrey

james_B_8 · ‎09-26-2013

Hi Andrey,

Thanks for that page of performance data. The dimensions are typically range from 512 to even as high as 4096 (we have a lot of memory).

I switched the code for you suggestion:

[cpp]

for ( k = 0; k < N3; k++ ) { DP0 = 0.0;}

vdRngUniform( METHOD, stream[], N3, &P0[0], -1.0, 1.0);

[/cpp]

For the same test case that ran in 0.4s serially (so slightly faster than drand48 actually), with increased performance for increasing numbers of threads. Need to test for all problem sizes next to check scaling.

Thanks,

James