Solved: Hello Victor,

victor_d_1 · ‎12-16-2016

I'm experiencing a consistent performance issue with MKL random number generators. I'm generating a uniform distribution of 500 Million floating-point or double-precision floating-point numbers in 0.0 to 1.0 range. The first time the performance in N numbers/second, and the second and subsequent times the performance is either 2N or 3N numbers/second. With this many elements being generated, this doesn't seem like a caching issue, but some sort of CPU/core affinity issue.

Has anyone else experienced this? Any ideas? I hate to miss out on this kind of performance level.

McCalpinJohn · ‎12-19-2016

If the original test case did not instantiate the pages for the 500 million element array (e.g., as was done with the memset() call), then the time required for the first MKL call will include the time for the OS to instantiate those pages. This can be a significant overhead -- the actual cost depends on the OS, the processor family, the number of processors, etc.

View solution in original post

mecej4 · ‎12-17-2016

I doubt that there is anything anomalous here.

Special tricks are often employed to increase the speed or period of RNGs. One of these tricks is to fill a table with RNs, and combine the values in the table with RNs from a different, often staggered, sequence from the same or another RNG, plus some shift and XOR operations. In the initial call to the RNG, this table has to be filled, just as a cache has to be filled before you see benefits from using a cache.

See, for example, http://www.cs.yorku.ca/~oz/marsaglia-rng.html . Marsaglia's SuprKiss32 and SuprKiss64, as offered at http://forums.silverfrost.com/viewtopic.php?t=1480 , have to prime the RNG by filling a table of over 20,000 RNs at first. Then, periodically, when the table is used up, it has to be refilled. Most of Marsaglia's development work was done decades ago, when CPUs were slow and high speed cache memory, even if available, would not have improved their performance.

There is a bit of humor in your post. If the RNG had been somewhat slower or had a shorter period, you would not have complained! We humans want our programs to be predictable and unsurprising.

victor_d_1 · ‎12-17-2016

At 500 Million floating-point numbers, running on a 4-core CPU, with 6MB combined cache, the tricks of multiple passes should not make 3X performance difference, as 500 Million elements at 4 bytes each for a total of 2GBytes of data to generate, which is over 300X bigger than the cache. Here are the real performance numbers for several different kinds of RNG algorithms:

MKL: VSL_BRNG_SFMT19937

MKL::vsRngUniform of RNG type 13,631,488 runs at 410,172,272 floats/second
MKL::vsRngUniform of RNG type 13,631,488 runs at 1,333,333,333 floats/second
MKL::vsRngUniform of RNG type 13,631,488 runs at 1,597,444,089 floats/second
MKL::vsRngUniform of RNG type 13,631,488 runs at 1,683,501,683 floats/second
MKL::vsRngUniform of RNG type 13,631,488 runs at 1,524,390,243 floats/second

MKL: VSL_BRNG_MCG31
MKL::vsRngUniform of RNG type 1,048,576 runs at 627,352,572 floats/second
MKL::vsRngUniform of RNG type 1,048,576 runs at 1,457,725,947 floats/second
MKL::vsRngUniform of RNG type 1,048,576 runs at 1,392,757,660 floats/second
MKL::vsRngUniform of RNG type 1,048,576 runs at 1,392,757,660 floats/second
MKL::vsRngUniform of RNG type 1,048,576 runs at 1,388,888,888 floats/second

MKL: VSL_BRNG_PHILOX4X32X10
MKL::vsRngUniform of RNG type 16,777,216 runs at 477,554,918 floats/second
MKL::vsRngUniform of RNG type 16,777,216 runs at 800,000,000 floats/second
MKL::vsRngUniform of RNG type 16,777,216 runs at 800,000,000 floats/second
MKL::vsRngUniform of RNG type 16,777,216 runs at 821,018,062 floats/second
MKL::vsRngUniform of RNG type 16,777,216 runs at 819,672,131 floats/second

The same array is being used between iterations. Thus, even memory alignment does not explain this level of inconsistency.

It would be sad to not have a root cause explanation for this one, as GPUs generate RNs at a much higher rate and consistently in performance iteration after iteration. 2X to almost 4X in performance is hard to overlook, as this would make MKL more competitive with GPU performance for RNG.

victor_d_1 · ‎12-17-2016

I forgot to mention that I'm running the parallel version of MKL on Windows.

victor_d_1 · ‎12-17-2016

Another clue to the issue is when I add a memset() to clear the 500 Million element array before each iteration of generating RNs, the performance becomes consistently high - even the first iteration performs well. It's as if having the last portion of the array in the cache at the starting point helps dramatically increase performance in some way.

mecej4 · ‎12-17-2016

Have you tried running Intel Amplifier or another profiling tool to locate the hotspots?

It would also be useful to have particulars on your hardware+OS and instructions from you on how to build and run a test case.

victor_d_1 · ‎12-17-2016

Not sure what information Intel Amplifier would provide since the issue is inside a single call to the MKL library

"returnCode = vsRngUniform(VSL_RNG_METHOD_UNIFORMBITS32_STD, stream, NumRandomValues, pRandArray_32f, 0.0f, 1.0f);"

and I don't have the source to the MKL library (is MKL open source?).

My setup is an ASUS laptop, running Windows 10, VisualStudio 2015, Intel Core i5-6300HQ quad-core CPU, NVidia 950M GPU, 8 GBytes of system memory.

It's weird that clearing the memory array before calling the above function for RNs of floating-point values, this seems to help performance substantially, but not in the case of the double-precision floating point equivalent function

"vdRngUniform(VSL_RNG_METHOD_UNIFORMBITS64_STD, stream, NumRandomValues, pRandArray_64f, 0.0, 1.0);"

McCalpinJohn · ‎12-19-2016

If the original test case did not instantiate the pages for the 500 million element array (e.g., as was done with the memset() call), then the time required for the first MKL call will include the time for the OS to instantiate those pages. This can be a significant overhead -- the actual cost depends on the OS, the processor family, the number of processors, etc.

victor_d_1 · ‎12-19-2016

Thank you John for the perfect explanation of the performance phenomena! The universe makes sense once again.

This really helped to provide a more fair performance comparison between variety of random number generators (on the blog page below):

https://duvanenko.tech.blog/2016/12/07/random-number-generator/?iframe=true&theme_preview=true

McCalpinJohn · ‎12-21-2016

The GPU performance is pretty impressive here....

It would be helpful to change the title of the Comparison Chart to make it clear that the results are for the MT19937 Mersenne Twister for each of the four generation methods.

I am a bit surprised that the MKL 32-bit float results are so much faster than the 64-bit float results, given that both of these appear to be based on the same underlying 32-bit integer PRNG (e.g., https://software.intel.com/en-us/node/590405).

It can be helpful to note the execution time as well as the rate -- this makes it a little bit easier to think about the relative costs of generating the pseudo-random numbers and transferring them between the GPU and system system memory.

For example, the generation rate of 4.7 Gfloats/sec for MT19937 corresponds to an execution time of 0.213 ns/float. For a PCIe gen3 x16 interconnect, unidirectional sustained data transfer rates max out at about 14 GB/s, so floats will take about (4 Bytes/(14 GB/s)) = 0.286 ns/float to transfer back to system memory. (I don't know if the graphics GPUs use full-width PCIe interfaces -- the time would double to 0.571 ns/float with a PCIe gen 3 x8 interconnect.) With the x16 interface the sustained rate including transfer time would be 1 float/(0.213ns + 0.286ns) = 2.00 Gfloats/sec, while with the x8 interface the sustained rate would be about 1 float/(0.213ns + 0.571ns) = 1.28 Gfloats/sec. The latter number is about 15% slower than using the CPU alone. (These estimates assume no overlap between PRNG generation and data transfer. A double-buffering scheme could probably be used to overlap much/most of the execution time, leaving performance limited primarily by transfer rates back to system memory.)

Andrey_N_Intel · ‎12-22-2016

Hello,

Below are considerations and recommendations related to use of Intel MKL RNGs in the applications

- many Monte Carlo applications are massively parallel, and generation of the random numbers is followed by their application specific post-processing.This observation gives the opportunity to apply different types of the optimizations to the code such as parallelization and vectorization. One such optimization option is to generate in parallel relatively small arrays of random numbers and immediately run post-processing while the numbers are in L1 cache.This approach excludes the need in the intermediate transfer of the random numbers to/from the memory

- From this perspective, generation of 500M numbers using one call to Intel MKL generator may not represent such Monte Carlo applications

- Intel MKL RNGs were designed keeping in mind simple ideas described above. Most of the RNGs (with few exclusions) in the library are serial. The library supports service capabilities such as SkipAhead, Leapfrog or BRNG family (such as >6K MT2203 BRNGs) that allow parallelizing the application on the user level. In addition to other advantages, this fact highlights flexibility in the choice of the parallelization technology. Intel MKL RNG based application can rely on OpneMP*, Intel TBB or native threads. It would be sufficient to have the local small array in each thread to store results of random number generation. Once the generation is done, thread can immediately apply post-processing. In some/many cases the post-processing can be vectorized using capabilities of Intel compiler or features such as Vector Math available in Intel MKL. After generation and processing of the current block of random numbers the thread generates the next array of numbers etc.

- Pseudo-code shows the ideas above

nCores = 24;

nSamples = 500000000;
nSamplesPerCore = nSamples / nCores;
blockSize = 1024; // can be adjusted for specific architecture and app
#pragma omp parallel for
for ( k=0; k< nCores; k++ )
{
VSLStreamStatePtr stream;
vslNewStream( &stream, VSL_BRNG_MT2203+k, seed );
float r[blockSize];
nIterations = nSamplesPerCore / blockSize;
for ( j = 0; j < nIterations; j++ )
{
     vsRngUniform( VSL_RNG_METHOD_UNIFORM_STD, stream, blockSize, r, 0.0f, 1.0f);
     // post-process r
    ...
}
   vslDeleteStream ( &stream );
}
...
- Of course, proper use of Intel MKL RNGs in Monte Carlo applications is defined by multiple factors such as period of the generator, reproducibility of the final result between serial and parallel versions of the code, algorithmic flow, etc and can require a different approach to the optimization

Hope this helps

Thanks, Andrey

mecej4 · ‎12-22-2016

MODERATOR:

Please correct spelling of "anomoly" -> "anomaly" in title, so future searches will find this thread properly.

victor_d_1 · ‎01-31-2017

Hello Andrey,

Thank you very much for the detailed explanation of the RNG facilities in MKL, their design and support for multi-threading. Your pseudo-code was also very useful. I used it as a starting point and have parallel version going. It's scaling well, giving a linear speed up when running in-cache, as expected. I haven't tried to grow the array to be in system memory yet, to see how well it will scale in that case. I am running into issue with it as only VSL_BRNG_MT2203 and VSL_BRNG_WH algorithms seem to support it and all others produce the same error of:

Intel MKL ERROR: Parameter 2 was incorrect on entry to vsRngUniform.

Do other algorithms require a different methodology for parallel implementations?

For parallel implementation VSL_BRNG_MT2203 is performing well at over 7.6 GigaFloats/sec, which compares well with the GPU random number generation shown in

https://duvanenko.tech.blog/2017/01/29/gpu-random-number-generators/

I plan to post these benchmark results once I get all of the MKL algorithms working in parallel that support parallelism.

I also updated the benchmarks of all MKL sequential RNG algorithms in a table that shows performance in each level of cache and system memory. Plus, I upgraded my laptop to have two channels of DDR4 memory for increased system memory performance.

https://duvanenko.tech.blog/2016/12/07/random-number-generator/

One of my personal use of RNG generators may be a bit unique (and may be not) as I use them to test sequential and parallel sorting algorithms, where a very large array of random numbers of different data types is generated, and then sorted. Here are some fun results of this

https://duvanenko.tech.blog/2016/12/07/sorting-algorithms/

At the bottom of this blog page is a link to an interactive version of the graph that's fun to explore and compare individual algorithms.

Anyhow, in this particular use of RNGs, the entire array and at times with billions of elements needs to be generated quickly, and is actually the limiting factor in the speed of testing. There are probably other applications with similar needs, where MKL provides a fabulous and huge level of acceleration over the standard C++ random number generation facilities. MKLs team efforts are truly appreciated by those of us that care about performance and capabilities!

Thank you,

-Victor

Andrey_N_Intel · ‎02-01-2017

Hello Victor,

Thanks for sharing the details. Depending on basic random number generator available in Intel MKL, different approaches to parallelization are supported. Say, MT2203 is a family of BRNGs, while MCG59 supports LeapFrog and SkipAhead() techniques. Please, have a look at the section in the Notes for Intel MKL Vector Statistics that discusses this topic in details, https://software.intel.com/en-us/node/590367. The document also explains which technique is supported by each generator, see for example, https://software.intel.com/en-us/node/590403

Additionally, it would make sense to have a look at performance data we publish for serial version of the distribution generators available at https://software.intel.com/sites/products/documentation/doclib/mkl/vs/vs_performance_data.htm

Can you please provide the short test example that demonstrates that "MKL ERROR"? It would help us to investigate and fix it, if any

Thanks,

Andrey

victor_d_1 · ‎02-02-2017

Hello Andrey,

Thank you for the links. I was hoping you would provide some useful pointers - and you did just that! I've seen the performance page before and will add it to my blog, along with correlation with my measurements.

Here is the parallel code that is having issues. It's based on your pseudo-code above. It only works for two RNG algorithms I mentioned above, and produces the same error for the rest of them.

int mklRandomFloatParallel_forIntel(int NumRandomValues, unsigned int seed, int RngType, unsigned int numTimes)
{
int nCores = 4;

   int nSamples = NumRandomValues;
   int nSamplesPerCore = nSamples / nCores;
   const int blockSize = 1024;                   // can be adjusted for specific architecture and app

#pragma omp parallel for
   for (int k = 0; k < nCores; k++)
   {
       double average = 0.0;
       VSLStreamStatePtr stream;
       //vslNewStream(&stream, VSL_BRNG_MT2203 + k, seed);       // original
       vslNewStream(&stream, RngType + k, seed);
       float r[blockSize];
       int nIterations = nSamplesPerCore / blockSize;
       int returnCode;

       for (int j = 0; j < nIterations; j++)
       {
           returnCode = vsRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, blockSize, r, 0.0f, 1.0f);
       }

       std::wstringstream msg;
       for (int j = 0; j < blockSize; j++) {
           average += r;
       }
       average /= blockSize;
       msg << "Sample mean of distribution = " << average << endl;
       wcout << msg.str();

       vslDeleteStream(&stream);
   }
   return 0;
}

Here is the function call that produces the error:

mklRandomFloatParallel_forIntel(NumRandomValues, seed, VSL_BRNG_MT19937, NumTimes);

Intel MKL ERROR: Parameter 2 was incorrect on entry to vslNewStream.
Intel MKL ERROR: Parameter 2 was incorrect on entry to vslNewStream.
Intel MKL ERROR: Parameter 2 was incorrect on entry to vslNewStream.

with one out of 4 tasks completing successfully.

Let me know what you find out,

-Victor

Andrey_N_Intel · ‎02-02-2017

Hi Victor,

The code above is functional for two BRNGs only, MT2203 and WH that represent the family of basic random number generators and support indexing scheme above. The other BRNGs that support parallel simulations provide either SkipAhead or Leapfrog techniques. Please, have a look at Intel MKL example vslskipaheadstream.c that gives the idea how to apply SkipAhead(). See also https://software.intel.com/en-us/node/521867 that explains the technique in more details. Please, remember to check which technique is supported by the specific BRNG. Let me know, if you need additional help here.

Also, as aside comment. When you compute nIterations as nSamplesPerCore / blockSize, you will need properly process the "tail", nTail = nIterations * blockSize - nSamplesPerCore and generate respective number of samples given nTail <> 0. For simplicity you can always choose nSamplesPerCore and blockSize such that nTail = 0

Thanks,

Andrey

victor_d_1 · ‎02-20-2017

Hi Andrey,

Thank you for the pointers to the MKL RNG skip ahead and leap frog support - these are quite powerful and make parallel versions possible. Here are some performance results for Float and Double RNG on a quad-core processor with dual-channels of DDR4. I'm seeing some strange performance behavior with the two Mersenne Twister algorithms as you'll see in the charts, but only for double. Thank you for pointing out the need to handle the tail condition.

https://duvanenko.tech.blog/2017/02/17/cpu-parallel-random-number-generator/

Also, bandwidth of about 14 GBytes/sec seems to be the limit for system memory for these routines, while my laptop is supposed to support 34 Gbytes/sec (theoretical peak). Do you know what causes the RNG algorithms to top out at 14 GBytes/sec instead?

Thank you,

-Victor

Andrey_N_Intel · ‎02-21-2017

Hi Victor,

thanks for sharing the details. It makes sense to provide extra technical details in your blog such as version of Intel MKL library (including linkage details etc), performance measurement methodology, number of random variates to generate, etc, in addition to performance data. It would help those readers who are interested in the results to repeat the measurements on their computers

For your observation related to scalability of single precision and double precision MT19937. Skip-ahead() technique for MT19937 and SFMT19937 is known to bring visible performance benefit when number of elements to skip is very big, say, order of millions. Recently, we improved the performance for MT19937 when nskip parameter is quite small, say, several hundred or so. Having said this, I have few questions: what Intel MKL version did you use, what was nskip parameter for MT19937, and did you include SkipAhead() into timing? Anyway, given all parameters but the precision of the performance experiment are identical we might expect similar scalability results for both double and single version of the generator.

For your second bandwidth related question, to confirm that 14 GBytes/sec makes sense for your system, the additional analysis may be necessary. Do you have chance to run your performance experiments an a server to get the idea about bandwidth for more powerful HW configuration?

Thanks, Andrey

Performance anomaly with Random Number Generators...