MKL vs Microsoft exp() function

David_C_12 · ‎05-24-2013

I have a client that is migrating a large C++ software base from 32- to 64-bit code in MS Visual Studio. One of the problems they are having is that the 32- and 64-bit versions of the C library exp() function produce results that differ by 1ulp for some operands, and this is causing regression tests to fail. One potential solution I am considering is to use Intel MKL instead of the Microsoft library. So I have a few questions:

1. Do the 32-bit and 64-bit builds of MKL produce identical results for exp() and other transcendental functions, for all operands, assuming that SSE2 is enabled for our 32-bit code?

2. Although the client has mostly Intel hardware, I believe they have a few AMD Opteron-based server farms. Does MKL work on Opterons? If so, are there any performance penalties if MKL is used in place of the Microsoft library?

3. Is there any way of getting the Microsoft .NET framework to use MKL? I assume it may have the same 32/64-bit differences, although I haven't tested that yet.

4. What other benefits might my client gain by switching to MKL?

Thanks in advance - dc42

TimP · ‎05-24-2013

You would have to call an MKL exp function explicitly (possibly by macro substitution) to use MKL as a solution, if I understand the question you have posed. MKL doesn't support scalar math functions, although you could treat a scalar function as a vector of length 1. You would choose the highest accuracy version; fast vector exponentials normally permit more than 1 ULP variation.

You might consider whether a trial of Intel C++ (which includes MKL) would determine whether there is a better solution. Intel C++ has its own library of math functions and supports an option /Qimf-arch-consistency:true in which you can give a list of functions for which you want a version which should give the same result on AMD and Intel platforms. You would also want to set a /arch option such as /arch:SSE3 which works on all your platforms. You may need also to consider the option /fp:model source which disables vectorization optimizations where data alignment could produce differences of 1 ULP.

MKL libraries (like C++ compiled code) work as unmanaged code under .net. See this:

http://software.intel.com/sites/default/files/m/4/b/0/6/5/33774-Using_Intel_MKL_IPP_NET_Framework.pdf

SergeyKostrov · ‎05-24-2013

>>...I have a client that is migrating a large C++ software base from 32- to 64-bit code in MS Visual Studio. One of the problems >>they are having is that the 32- and 64-bit versions of the C library exp() function produce results that differ by 1ulp... I will do a verification and post results. What Windows platform are you using? >>1. Do the 32-bit and 64-bit builds of MKL produce identical results for exp() and other transcendental functions, for all >>operands, assuming that SSE2 is enabled for our 32-bit code?.. Did you try to download an Evaluation version of MKL? I wouldn't expect to see test results from somebody especially for simple, and at the same time critical, tests.

SergeyKostrov · ‎05-24-2013

Here are two questions: - Did you use _set_SSE2_enable CRT function? ( Take a look at MSDN for a description of the function and what it does ) - Could provide more technical details for '...results that differ by 1ulp...'? ( a test-case? outputs? etc )

David_C_12 · ‎05-24-2013

Tim, thanks for your reply, that was very helpful. Compatibility between results from AMD/Intel processors is not a big issue for us, we've found that differences are rare and we have worked around them. Compatibility between 32- and 64-bit results matters much more. Based on your reply, the Intel compiler might help us, but not MKL.

Sergei, we haven't tried the evaluation version of MKL yet. It would take at least two days to rebuild the code base using MKL and run the tests, and the purpose of my question was to determine whether it would be a worthwhile exercise.I didn't call _set_SSE2_enable, because based on Microsoft's documentation, I don't need to because it is the default on SSE2-enabled processors. I did check with the debugger that it was following the SSE2 code path inside the exp() function. Disassembling the exp() functions revealed that the Microsoft 64-bit and 32-bit SSE2 exp functions use different algorithms.

Bernard · ‎05-25-2013

There are actually two version of exp() functions one of them is exp() and second is _Clexp both of them call into the same SSE2 code implementation.

David_C_12 · ‎05-25-2013

Hi Sergey, in case you still want to do verification, I am using Visual Studio 2010 under Windows 7 64-bit. Some of the operand values for which the Microsoft implementations of exp() return different values in 32-bit SSE2 and 64-bit builds are:

-0.083296271058701077 0.0066998698389698335 0.0066998698389698344 0.0066998698389698352

Hi illyapolak, is _Clexp the one that gets called if you have SSE2 code generation and intrinsic functions enabled?

TimP · ‎05-25-2013

Sergey touched on a good point. With Microsoft compiler, you need to set /arch:SSE2 in order for the 32-bit compiler to generate SSE2 code as the X64 one does by default. In VS2012 you have the option to generate AVX code by /arch:AVX. I don't know the details of how these options impact Microsoft math libraries, but you could expect numerical differences, particularly with float data types, between default 32-bit non-SSE code and 64-bit SSE2, even if the math functions were identical.

The Intel compilers changed the IA32 default to /arch:SSE2 a few years ago, so that IA32 and Intel64 defaults match. Only the 32-bit compilers present the choice of x87 code, which is default for Microsoft CL and invoked by /arch:IA32 for Intel ICL. The latter option may give you the same math functions which CL uses by default.

Bernard · ‎05-25-2013

Hi David,

I do not know when compiler chooses to call CI version of exp function.By looking at msvcrt dll I can see that both versions have different entry points and different ordinals.Both of implementation have different prolog,but main calculation block is the same.By looking at code I can see that some kind of Horner scheme is used.

Bernard · ‎05-25-2013

>>>0.0066998698389698335 0.0066998698389698344 0.0066998698389698352 >>>

It differs each other by 14 decimal orsers of magnitude.

David_C_12 · ‎05-25-2013

Hi Tim, we're using VS2010 and we're already setting /arch:sse2 in an attempt to get the 32- and 64-bit results to match as closely as possible. I have verified with the debugger that the SSE2 code for 32-bit exp() is executed.

Hi illyapolak, yes those 3 operands differ from the next in the series by only 1ulp. All of them produce the same value for exp(), with the 32-bit SSE2 and 64-bit results differing by 1ulp. The problem we have is not that at some of the results are more than 0.5ulp out, it's lack of reproducibility between 32- and 64-bit builds of the applications, which means that test results have to be inspected manually

Bernard · ‎05-25-2013

>>>I have verified with the debugger that the SSE2 code for 32-bit exp() is executed>>>

Can you post mentioned above code?

Bernard · ‎05-25-2013

>>>I assume it may have the same 32/64-bit differences, although I haven't tested that yet.>>>

I think that you can expect the same differences because NET runtime will probably call C runtime functions.

>>>The problem we have is not that at some of the results are more than 0.5ulp out, it's lack of reproducibility between 32- and 64-bit builds of the applications, which means that test results have to be inspected manually>>>

I think that this is silly question, but can you indentify those input arguments when your regression test fail and provide precalculated results to ported application?

SergeyKostrov · ‎05-25-2013

>>...It would take at least two days to rebuild the code base using MKL and run the tests, and the purpose of my >>question was to determine whether it would be a worthwhile exercise... I don't think this could be allowed on some projects to proceed with changing main codes, that is integration of a very big library, like MKL, without having a clear picture ( a complete test case! ) on how MKL exp() function could change matters. You could simply spend even more days on integration of MKL and could get a negative result. Even if a negative result is also result a couple of days will be lost.

SergeyKostrov · ‎05-25-2013

>>Some of the operand values for which the Microsoft implementations of exp() return different values in 32-bit >>SSE2 and 64-bit builds are: >> >>-0.083296271058701077 0.0066998698389698335 0.0066998698389698344 0.0066998698389698352 Here are binary representations of these numbers: 0.0066998698389698335 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101 0.0066998698389698344 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101 0.0066998698389698352 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101 Any questions? Please provide a simple test case and I'll take a look at it on a couple of platforms. Thanks in advance.

David_C_12 · ‎05-25-2013

Hi illapolak,

>>Can you post mentioned above code?<<

You mean the disassembly? That will have to wait until I am back in the office.

>>I think that this is silly question, but can you indentify those input arguments when your regression test fail and provide precalculated results to ported application?<<

That would mean using different regression test data for the 32- and 64-bit builds. We want to use the same data for both builds, partly because it saves time, partly as a check that we haven't done anything wrong in porting the app to 64 bits.

Hi Sergey,

>>Here are binary representations of these numbers:

0.0066998698389698335 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101
0.0066998698389698344 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101
0.0066998698389698352 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101<<

That may be true in single-precision (float) maths, but we're using double precision. Expressed as doubles, those 3 numbers are different. More precisely, they are decimal approximations of 3 adjacent binary double-precision (64-bit) IEEE floating-point numbers in a series generated using the _nextafter function (see http://msdn.microsoft.com/en-us/library/h0dff77w(v=vs.100).aspx). I'll post the test code when I am back in the office.

Bernard · ‎05-25-2013

>>Can you post mentioned above code?<<

You mean the disassembly? That will have to wait until I am back in the office.>>>

Yes. Thank you.

Bernard · ‎05-25-2013

>>>0.0066998698389698335 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101>>>

decimal value has precision of 17 decimal digits so it should be represented by double precision floating point value.

SergeyKostrov · ‎05-25-2013

>>...I'll post the test code when I am back in the office... I will need as more as possible details about your software development environment, that is, OS version, CPUs of computers, VS versions and Editions, etc.

SergeyKostrov · ‎05-25-2013

>>...That may be true in single-precision (float) maths, but we're using double precision. Expressed >>as doubles, those 3 numbers are different... David, That really makes a difference ( please don't skip such details next time... )! [ Float ] 0.0066998698389698335 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101 0.0066998698389698344 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101 0.0066998698389698352 -> Binary = 0x3BDB8A95 = 00111011 11011011 10001010 10010101 [ Double ] 0.0066998698389698335 -> Binary = 0x3F7B71529D88875E = Binary A 0.0066998698389698344 -> Binary = 0x3F7B71529D88875F = Binary B 0.0066998698389698352 -> Binary = 0x3F7B71529D888760 = Binary C Binary A = 00111111 01111011 01110001 01010010 10011101 10001000 10000111 01011110 Binary B = 00111111 01111011 01110001 01010010 10011101 10001000 10000111 01011111 Binary C = 00111111 01111011 01110001 01010010 10011101 10001000 10000111 01100000 David, Which binary, that is Binary A, or Binary B, or Binary C, gives a right result(s) in your regression tests? Note: Data for Long Double will be posted later. [ Long Double ] 0.0066998698389698335 -> Binary = 0x... = not ready yet 0.0066998698389698344 -> Binary = 0x... = not ready yet 0.0066998698389698352 -> Binary = 0x... = not ready yet

SergeyKostrov · ‎05-25-2013

>>...That would mean using different regression test data for the 32- and 64-bit builds. This is a wrong direction. >>We want to use the same data for both builds, partly because it saves time, partly as a check that we haven't done anything >>wrong in porting the app to 64 bits... This is a right direction and input data sets and source codes ( 32-bit and 64-bit ) should be identical. Only in that case consistency of all calculations could be achieved. By the way, Do you use a CRT function _control87 that controls precision ( of calculations ) and another parameters of FP units?