topic Hello Alina, in Intel® oneAPI Math Kernel Library

Should we generate random number (using VSL_RNG) on the fly or prior to the loop?

AThar2 — Mon, 09 Sep 2019 12:04:24 GMT

Hello,

I am currently learning about how to use random functions, and am using the mkl version VSL_RNG.

I have made this simple code which compares the efficiency with generating all random numbers at once or doing so on the fly. The code runs in parallel where I am using VSL_BRNG_WH+rank to generate a different generator for each MPI process.

For generating nmax=1e8 numbers I get the following:

time = 0.35 seconds for generating all numbers at once (n=1 setting in the code)

time = 16 seconds for generating on the fly (n=2 setting in the code)

Is this an expected behaviour. Is it generally expected that the speed is much faster for doing all generating numbers at once before entering a loop?

include 'mkl_vsl.f90'

program rnd_test

use MKL_VSL
use MKL_VSL_TYPE
use mpi

implicit none
   real(kind=8) t1,t2  ! buffer for random numbers
      real(kind=8) s        ! average
      real(kind=8) a, sigma ! parameters of normal distribution
      real(kind=8), allocatable :: r(:) ! buffer for random numbers

      TYPE (VSL_STREAM_STATE)::stream

      integer errcode
      integer i,j, n11, nloop, nn
      integer brng,method,seed,n, ierr, size, rank
      integer(kind=8) :: nskip, nmax
      call mpi_init(ierr)

      call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr)
      call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)

      n = 1
      s = 0.0
      a = 5.0
      sigma  = 2.0

      nmax = 1e8

!-----------------------------------------------------------------------
      nn = 2 ! (1): all at once. >1: on the fly
!----------------------------------------------------------------------






      nloop = 0


      if(nn>1)then
         nloop=nmax
         nn = 1
      else
         nloop=1
         nn = nmax
      endif


      allocate(r(nn))

      method=VSL_RNG_METHOD_GAUSSIAN_ICDF
      seed=777
      brng = VSL_BRNG_WH+rank

!     ***** Initializing *****
      errcode=vslnewstream( stream, brng,  seed )

      t1 = 0.
      t2 = 0.
      t1 = mpi_wtime()

!     ***** Generating *****
      do i = 1, nloop
          errcode=vdrnggaussian( method, stream, nn, r, a, sigma )
!         s = s + sum(r)
      end do

      t2= mpi_wtime()

!      s = s / 10000.0

      print*, "time: ", t2-t1
      call mpi_barrier(MPI_COMM_WORLD,ierr)
!     ***** Deinitialize *****
      errcode=vsldeletestream( stream )




end program

best

Ali

Hi Ali,

Alina_E_Intel — Tue, 10 Sep 2019 09:16:38 GMT

Hi Ali,

Per my analysis of your testcase, n is expected to be number of threads and nn – number of random variates to be generated by each thread.

The following lines use nn as the number of threads, and declare n but do not use it:

      n = 2
      nn = 2 ! (1): all at once. >1: on the fly

As result of the following lines

      if(nn>1)then
         nloop=nmax
         nn = 1

number of iterations nloop will be nmax, and only one random number will be generated in the thread:

!     ***** Generating *****
      do i = 1, nloop
          errcode=vdrnggaussian( method, stream, nn, r, a, sigma )
!         s = s + sum(r)
      end do

We do not recommend using Intel MKL RNGs for vector lengths nn less than few hundred, thus I suggest to modify the code as shown below:

      if(n>1)then
         nloop=n
         nn = nmax/n

In this case each of n threads would generate nmax/n random numbers, what is expected to improve the performance of your application.

Additionally, depending on the application we recommend generating a sequence of random numbers of size nmax/n in blocks of the fixed size (for example, 512 or so, actual block size is defined using a set of quick performance experiments on your CPU).

In this case if the generation of the random numbers is immediately followed by their postprocessing, you would see extra performance benefit due to improved data locality

Please, let me know, if my interpretation of your test case is correct and share respective results of your performance experiments with us

Best regards,

Alina

Hello Alina,

AThar2 — Wed, 11 Sep 2019 11:12:29 GMT

Hello Alina,

Thanks for the reply.

Let me just go through your assumptions on my code.

Is not the number of threads. I am using mpi to run in parallel only, hence, the only to discern between one thread/core/mpi_rank and another is

rank

This is why I am doing

      brng = VSL_BRNG_WH+rank

The idea is that I would like to use say,

nmax= 1e8

random numbers. The question is whether I generate those numbers in the beginning or on the fly of simulation. So if say I have the following

do i = 1,1e8 
    ! ... Some operations 
    v(i) = (....) + r(i) 
 enddo

Now the question is whether the random numbers contained in `r` should be generated apriori to the loop, or in buckets in the loop or for each `i` I generate one random number. For example

do i = 1,1e8 
   ! ... Some operations
   errcode=vdrnggaussian( method, stream, 1, r(1), a, sigma ) v(i) = (....) + r(1) ! -
enddo

My code is just comparing the difference between generating 1e8 random numbers at once, OR generating one random number 1e8 times.

Hence I am writing this comment

!-----------------------------------------------------------------------


     nn = 2 ! (1): all at once. >1: on the fly

!----------------------------------------------------------------------

If you set nn=1, all random numbers will be generated at once and therefore I set the nloop to 1. Otherwise, if nn > 1, we do the opposite.

To your answer:

I see that the instance where we generate one random at a time is not recommended. While I should rather generate them all.

To summarise: I have two questions.

1) What if my application requires different lower and upper bounds for the random function in the same loop. For example, for the gaussian distribution I literally have a routine that runs from 1 to n where my a and sigma varies. I need to calculate those for each loop iteration.

Can you explain what you meant here:

Additionally, depending on the application we recommend generating a sequence of random numbers of size nmax/n in blocks of the fixed size (for example, 512 or so, actual block size is defined using a set of quick performance experiments on your CPU).

Thanks again

best

Ali

Alina, Please can you also

AThar2 — Wed, 11 Sep 2019 11:31:57 GMT

Alina, Please can you also let me know if it completely valid to use same stream for different random functions. For example, I am currently using both a uniform and gaussian distribution for the same application using the same stream, generator etc.

Hi Ali,

Alina_E_Intel — Thu, 12 Sep 2019 07:09:58 GMT

Hi Ali,

Thank you for providing additional explanations of the parameters of your application. Answering your questions:

What if my application requires different lower and upper bounds for the random function in the same loop. For example, for the Gaussian distribution I literally have a routine that runs from 1 to n where my a and sigma varies. I need to calculate those for each loop iteration.

Usage of Intel MKL RNGs is not optimal for vector length 1 as it results into extra overhead and does not use vectorization opportunities of hardware. To address your use case, I suggest generating blocks of random variates r from the Gaussian distribution with zero expectation and unit standard deviation and then scale them r * sigma + a

Can you explain what you meant here: Additionally, depending on the application we recommend generating a sequence of random numbers of size nmax/n in blocks of the fixed size (for example, 512 or so, actual block size is defined using a set of quick performance experiments on your CPU)

Let’s assume that each thread (or MPI process) is expected to generate visible number of random variates defined by the external parameters such as number of threads (e.g., 100) and the total amount of required of random numbers (100 000 000). One option is to uniformly split generation of the portions of the random numbers between those threads, and each thread will generate 1 000 000 variates by calling Intel MKL RNG. This however may be suboptimal when you expect postprocessing of the numbers: the numbers would be washed out from say L1 cache after the generation, and stage of the postprocessing would wait when the numbers are back. Another option is split generation of 1 000 000 variates into blocks of quite small size of say 1000 and immediately process them. You can do it in the loop of 1000 iterations. In this case, the numbers are in L1 cache, and respective cache misses on the stage of the postprocessing are minimized. The actual block size will depend on the characteristics of your CPU and problem settings

Please can you also let me know if it completely valid to use same stream for different random functions. For example, I am currently using both a uniform and gaussian distribution for the same application using the same stream, generator etc

Yes, it’s completely valid use scenario as the internal state of underlying basic random number generator which serves as the source of uniform random numbers for the distribution generators is updated after each call. We however do not recommend using the same stream to produce random numbers in different threads. In this scenario, please use respective interfaces for initialization of the generators for the use in the parallel mode.

Please, let me know, if it answers your questions and if you need more help on the use of Intel MKL RNGs from our side.

Best regards,

Alina

Hello Alina,

AThar2 — Thu, 12 Sep 2019 08:36:43 GMT

Hello Alina,

Your answers are right to the point! Very useful

Thanks very much.

best regards

Ali