BoxMuller Random Deviate differences between processors?

mattwatt · ‎06-23-2011

Hello,

I'm generating random deviates using the Intel MKL and have observed differences in the results when using a Xeon when compared to a Core i7. Has anyone else observed this? Is the level of error observed expected? I don't have any differences when using the uniform or ICDF transformations, only the BoxMuller.

With a Xeon, I get1.682149939433206

With an i7, I get 1.682149939381822

I'm using Windows Server 2008 with Visual Studio 2010.

The code I'm using:

#include

#include "mkl_vsl.h"

int main (){

double s;

double r[2];

VSLStreamStatePtr stream;

/* Initializing */

s = 0.0;

vslNewStream( &stream, VSL_BRNG_MT2203, 0x345);

/* Generating */

vdRngGaussian( VSL_RNG_METHOD_GAUSSIAN_BOXMULLER, stream, 2, r, 0.0, 1.0 );

s = r[0];

/* Deleting the stream */

vslDeleteStream( &stream );

/* Printing results */

printf( "Random deviate 1: %1.15f\\n", s );

system("pause");

return 0;

}

Thanks,

Matt

mecej4 · ‎06-23-2011

The two values that you showed agree to 32 bits in the mantissa. Therefore, if the normally distributed RNs were derived from a 32-bit integer RNG, there is no difference to speak about. You can search the documentation to see the details of the specific RNG used (VSL_BRNG_MT2203).

mattwatt · ‎06-23-2011

I see what you're saying, but I still find it intriguing.

Based on your comment, I used the vsRngGaussian function to again compare results. What I found is that the results are identical between the two processors, even beyond the mantissa (ie: both show a result of 1.682149887084961). Does this suggest that the distribution function is generating numbers consistently, but the conversion from single to double via the vdRngGaussian function is introducing error?

mecej4 · ‎06-23-2011

>the results are identical between the two processors, even beyond the mantissa

Sorry, you are overlooking the fact that there is a conversion from single- or double-precision IEEE binary to decimal when you ask for printout as a decimal string. Only numbers that are equal to a 23-bit integer multiplied by a power of 2 are exactly representable in IEEE-float. Therefore, there is a distinct possibility that the differences you see are attributable to differences in the printf float->decimal conversion routines.

>but the conversion from single to double via the vdRngGaussian

I don't know that there is such a conversion, nor do I know the internals of MKL. Here is an example code fragment from one Box-Muller code. Note that both log and sqrt are involved. Therefore, CPUs with different FPUs can give slightly different results, if the compiler flags allow such deviations from the IEEE standard.

[cpp] do {
         x1 = 2.0 * ranf() - 1.0;
         x2 = 2.0 * ranf() - 1.0;
         w = x1 * x1 + x2 * x2;
 } while ( w >= 1.0 );

 w = sqrt( (-2.0 * log( w ) ) / w );
[/cpp]

Andrey_N_Intel · ‎06-24-2011

Hello Matt,
Intel MKL Random Number Generators are developed using the calls to the optimized Basic Random Number Generators (BRNGs) and Vector Math Functionswhich are usedfor transformation of BRNG outputs.
We optimize Vector Math functionstoget the good performance of the routinesfor each CPU generation.Vector Math functionswork in 3 accuracy modes: High Accuracy, Low Accuracy, and Enhanced Performance. The accuracy is CPU specific, and you can see slightly different results returned by the same function between different CPUs. Nevertheless, the differences remain within specific error bounds, and each function meets accuracy requirements oneach CPU generation.
You can find more details about accuracy modes in Intel MKL Manual, Chapter "Vector Math Functions" and the document "VML Accuracy and Performance Data" available at http://software.intel.com/sites/products/documentation/hpc/mkl/vml/vmldata.htm
Hope this explains your observations.

Wouldyou please clarify why you care about the slight difference in RNG outputs on different CPUs?

Thanks,
Andrey

Sergey_M_Intel2 · ‎06-24-2011

Hi Matt,

Yes, this is expected behavior. MKL functions, including RNG, are optimized for different processor generations. At run time MKL detects the processor it is running on and selects the best implementation for that processor. These implementations might be algorithmically different but still the accuracy is controlled in such a way to have negligible effect on the application.

In your case the results are identical to 10 decimal digits. This is more than sufficient in Monte Carlo where statistical error dominates. Also such deviations do not statistically worsen the generated numbers. All VSL RNGs are extensively tested by the battery of statistical tests.

As to why you don't observe this on ICDF - this is just your luck :-). For certain inputs it is possible that ICDF also shows minor deviations in results on different CPUs.

Feel free to ask more questions and I will gladly answer those,
Sergey

mattwatt · ‎06-24-2011

Hi Andrey, Sergey,

I'm interested in this problem because I work for a firm in the financial services sector and we're planning on using the Azure cloud to perform most of our calculations. The discrepancy I observed was simply between two sets of developer workstations (mine, a Xeon, and everyone else's, the i7's). This came up after a unit test failed. I'm also exploring a more significant difference between processors (1e-5) with an optimization problem.

We will be using Monte Carlo simulations to drive many of our calculations. Given that we are a vendor, we would like to ensure that our clients will receive the same result when the same computation is performed. While the discrepancy may be small, any lack of consistency is problematic for us. A portfolio worth $500 million would be worth 40 billion yen, where the current level of error could show a difference in the final digit or two, depending on the processor. Yes, it's small, but it could raise eyebrows.

I looked at the document you referenced, and the default behavior is High Accuracy, meaning there should beerror less than 1.0 ulp (unit-in-the-last-place). This tolerance is met, given the use of a 32-bit random stream of integers, which presumably defines the number of significant digits for the rest of the problem...

Initially, I was under the impression that the generation method,VSL_RNG_METHOD_GAUSSIAN_BOXMULLER, was single precision, but the generator was double precision. This didn't make sense. I assume there is confusion over the naming convention--VSL in this case meaning Vector Statistics Library, while vslNewStream (for example) does in fact denote single precision. The only 32-bit operation comes from the stream itself, no?

Ultimately, it seems that when working with variants of processors, it would be best to use the single precision function (vsRngGaussian, etc...). Given the differences observed, and the certainty that an input to the double precision function is itself single precision (there are no double precision streams, are there?), is the double precision function handicapped by typecasting? Is there any point to having the double precision functions?

Again, thanks for your responses,

Matt

mattwatt · ‎06-24-2011

One further note...

When I generate a set of 2 uniform random numbers and compute the Box Muller transformation myself, the results agree between the two processors.

Sample code...

#include

#define _USE_MATH_DEFINES

#include

#include "mkl_vsl.h"

#include "mkl_vml_functions.h"

int main ()

{

double n[2];

VSLStreamStatePtr stream;

vslNewStream( &stream, VSL_BRNG_MT2203, 0x345);

vdRngUniform( VSL_RNG_METHOD_UNIFORM_STD_ACCURATE, stream, 2, n, 0.0, 1.0 );

vslDeleteStream( &stream );

double u1 = n[0];

double u2 = n[1];

double x = sqrt(-2*log(u1))*sin(2*M_PI*u2);

printf( "Box Muller Manual 1: %1.15f\n", x );

system("pause");

return 0;

}

< 1.682149949310261="">

Sergey_M_Intel2 · ‎06-27-2011

HiMatt,

Thank you for additional detail.

If you call vdRngGaussian then all underlying floating-point arithmetic is double precision, not single precision. There is no confusion here. vslNewStream (and many other service routines) is precision independent - it initializes underlying BRNG (state and a few other parameters) which is integer sequence generator. All precision-dependent RNGs in VSL start with prefix "vs" for single precision and "vd" for double precision. So you have full control over VSL precision. Double precision does not map to single precision in any of VSL RNGs.

Unlike VML you do not have control over VSL accuracy. Yes, some VSL RNGs use VML kernels underneath but those accuracyis notcontrolled by VML mode. There isno <1 ulp guarantee for VSL RNGs. Perhaps that was confusing partin Andrey's response. VSL RNG accuracy is relaxed but still more than enough for Monte Carlo simulations.

To summarize, the differences you see likely comes from different code paths for Xeon and i7. Unerlying integer BRNG sequence is identical for Xeon and i7, but the floating-point Box-Muller transformation (double precision if you call vdRngGaussian)implementations may be slightly different for Xeon and i7 to gain maximum performance on each of these processors.

Changing vdRngGaussian to vsRngGaussian will not solve thebit-to-bit difference. Sorry. None of VSLRNGguarantees bitwise identical result across CPU generations. The only resolution with guaranteed result is running the same code path on both Xeon and i7. Unfortunately there are no user controls to fix the code path in current MKL product.

I will bring your issue to our technical support team so that it is properly tracked. We will consider options how we can address your needs.

Thank you for you time,
Sergey

Sergey_M_Intel2 · ‎06-27-2011

HiMatt,

Additional comment on your code example illustrating Box-Mullerimplementation in C (avoiding call to vdRngGaussian).

Unfortunately there is still no guarantee that the result is bit-to-bit identical across CPUs. Modern optimizing compilers (such as Intel C++/Fortran compilers) can do code versioning for different data layouts (aligned/not aligned) or CPUs. In other words extra action is required on the user side to ensure guaranteed cross-CPU consistency. Compiler math library functions may have different code paths for different CPUs, so there is no guarantee that this line provides bitwise identical output:
double x = sqrt(-2*log(u1))*sin(2*M_PI*u2);

To guarantee cross-CPU consistency of math library functions with Intel compilers you should compile your application with the switch -fimf-arch-consistency=value[:funclist]. You need also control data layouts in memory so that the same code path is used for such layout.

You can find additional details on this topic by following this link http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/lin/copts/common_options/option_fimf_arch_consistency.htm

Please be noted that setting -fimf-arch-consistency=true will have performance impact.

I hope that this helps. We'll see how we can address your needs within MKL functions also.
Regards,
Sergey

mattwatt · ‎06-28-2011

Thanks Sergey. Let me know if you need to know anything more about our specific implementation or usage.

--Matt