Quad precision floating point arithmetic with SSE/AVX? - Page 2

Mikalai_K_Intel · ‎05-26-2011

Does latest x86 architecture offer native support for quad precision (QP)floating-point (FP)arithmetic?
If no,canQP be emulatedon XMM and YMM registers with small overhead (< 2X slowdown)compared to the double precision FP arithmetic?
Thanks,
Nick

Bernard · ‎07-18-2012

Our current solution under investigation is a software implementation of a 128-bit decimal type based on IEEE 754-2008. The performance so far is 1-2 orders of maginitude slower than the corresponding 64-bit data types currently used

Without native hardware acceleration nothing can be done to improve the performance i.e eliminate store/lode overhead needed to represent 128-bit number implemented in software.

TimP · ‎07-18-2012

The quad precision binary floating point types implemented by software in the popular compilers should be faster than decimal types of similar size. If register data locality is required, to avoid memory bandwidth limitations, hardware or firmware instruction implementation should have an advantage. This would take several years of evidence of application level value and hardware development.

SHIH_K_Intel · ‎07-18-2012

Although I cant speak to the state of art with respect to 128-bit binary FP software libraries, I can share some insights about performance improvements of IEE754-2008 BID floating-point math that's readily feasible without 128-bit FP hardware support.

If you look at how these FP numbers are encoded, whether its hw or software solutions, they must deal with

1. Bit-fields extractions
2. Special case pruning of INF/NANs
3. Binary arithmetic operations on large bit strings to achieve the precision provided in the width of the mantissa
4. Normalization of mantissa/exponents and required rounding operations
5. Fix-up and packing bit fields back into IEEE encoding.

Whether it's hardware or software, these result-dependency chains are hard to get around. I would venture to suggest any expectation of not more than an order of magnitude slow down, when committing to use quad-precision, is unrealistic.

In my experiences with Intel's Binary Integer Decimal (BID) FP library, basic arithmetic operation cycles does take more 10x longer, if you want to compare to the hardware accelerated 64-bit Binary FP encoding. Bear in mind that the operational cycle of arithmetic are value-dependent, the most cycle-consuming BID128_add can take 180ish cycles, while typically it's more like 80ish. On the other hand, the most cycle-consuming case of BID128_mul can take more than 400 cycles, while mid 200 is more typical.

Without 128-bit FP hardware, there are several places that algorithmic/existing ISA and application architecture can make large speed up.

Take BID128_mul for example, multi-precision arithmetic using existing ISA's 64-bit only MUL instruction to produce 128-bit result will have the strongest impact on accelerating BID128_MUL. Additional gains will follow if using the MULX instruction that will be in the market in 2013.

The chore of initial bit field extraction and special case pruning can be handled using existing SSE instructions, so that the lengthy multi-precision math code can start sooner.

The net result is on Sandy Bridge based processor, the most cycle-consuming BID128_MUL case would take less than 200 cycles and typical cases are well below 100 cycles, without 128-bit FP hardware and no new ISA.

It may be feasible in an application that the special case input ranges can be handled at a different stage than relying on an arithmetic library function to implement the defensive pre-processing stage, as typical software library need to do.

SergeyKostrov · ‎07-18-2012

Quoting Shih Kuo (Intel)

Although I cant speak to the state of art with respect to 128-bit binary FP software libraries, I can share some insights about
performance improvements of IEE754-2008 BID floating-point math that's readily feasible without 128-bit FP hardware support.

If you look at how these FP numbers are encoded, whether its hw or software solutions, they must deal with

1. Bit-fields extractions
2. Special case pruning of INF/NANs
3. Binary arithmetic operations on large bit strings to achieve the precision provided in the width of the mantissa
4. Normalization of mantissa/exponents and required rounding operations
5. Fix-up and packing bit fields back into IEEE encoding...

Well, could you try to explain this to a person who does accounting in some company that is in a Currency Exchange
business? Or, have a chat with somebodywho does accounting in Intel. I'll bevery glad to hear a response
from that person.

Intel Software Engineers, please try to look Out-Of-The-Box.

Nick has a real life problem related to a "Rounding of a Financial Transaction". Even if Nick's company spends
many millions of dollars on FPGAs, or 128-bit/256-bit/etc precision library, it won't fix a really simple problem and
take a look at:

http://www.irishwebmasterforum.com/coding-help/5997-accounting-for-rounding-errors.html

Please respond me how a "magical" 128-bit precision hardware will solve that problem?

Some big companies, like SAP, havevery flexible rules on how to do roundings and take a look:

http://help.sap.com/saphelp_rc10/helpdata/en/18/8b8a3a068ada7fe10000000a114084/content.htm

Or, take a look at:

http://blog.acrossecurity.com/2012/01/is-your-online-bank-vulnerable-to.html
http://docs.oracle.com/cd/A60725_05/html/comnls/us/mrc/currco01.htm

Thanks in advance for your time!

Best regards,
Sergey

jimdempseyatthecove · ‎07-20-2012

Sergey's suggestion of a normalization factor is good.
Another route to consider is to choose a monetary quanta and perform all calculations in units of quanta.

A quanta would be defined as an indivisible unit of money. An example of choice would be $1.0e-8.

This is approximately 1.0e-3VND. 64 bits then could handle amounts up to ~+/-$35,000 Trillion US$.
Round off error of ~ 1/1000th of 1VND might be acceptable. Even 1 million carefully crafted transactions couldn't skew the result by more than 1 cent.

Periodically the quanta could be deflated (the normalization factor Sergey was talking about). At some date a value for 1 quanta is chosen and defined as having a normalization factor of 0. Then at some future date (assuming inflation) you could then declare we are now using the :new" quanta with normalization factor of 1 wit respect to first generation quanta.

Jim Dempsey

SergeyKostrov · ‎07-26-2012

Quoting Shih Kuo (Intel)

...In my experiences with Intel's Binary Integer Decimal (BID) FP library...

Could you provide a test-case or link(s) to binaries / sources of the Intel BID FP library? Thanks.

Best regards,
Sergey

TimP · ‎07-27-2012

Intel decimal library is included in the gcc source distribution.

SergeyKostrov · ‎07-27-2012

Quoting TimP (Intel)

Intel decimal library is included in the gcc source distribution.

Thank you, Tim! Are there plans to include Intel BID libraryin Intel C++ compiler for Windows?

TimP · ‎07-28-2012

The claims for the netlib version of the library imply you are entitled to try it yourself.

SergeyKostrov · ‎07-28-2012

Quoting TimP (Intel)

The claims for the netlib version of the library imply you are entitled to try it yourself.

This is exactly what I was looking for. Thank you, Tim!

Now I need to schedule some timefor R&D and I will compare Intel BID library with Borland BCD Number library.

Best regards,
Sergey

yuriisig · ‎07-29-2012

Dekker's method for doubled-single extended gives accuracy of 128 bits. Faster this algorithm for such accuracy is not present.

SergeyKostrov · ‎07-29-2012

Quoting yuriisig

Dekker's method for doubled-single extended gives accuracy of 128 bits...

Yurii, I can't find any references for that method on the Internet (Google search was used ). Could youprovide me
with internetlinks or docs, please? Thanks in advance.

PS: I've found this http://en.wikipedia.org/wiki/Dekker's_algorithmbut it is a different one andfor a concurrent programming.

yuriisig · ‎07-29-2012

See the book: Handbook of Floating-Point Arithmetic. Intel fortran compiler works c quad, but speed very slow.

SHIH_K_Intel · ‎07-30-2012

Hi

I can sketch the test approach of my study, which was geared towards uncovering opportunities of vectorization and native ISA performance headroom that were not exploited. This is different from typical usage of library users. But some parts may be of interest to you.

As a background, GCC supports its own data type, _Decimal128, which maps to BID128 when built for x64 architecture. GCC's native language support for _Decimal128 extension on x64 architecture is essentially wrapped on top of Intel BID library with some of the flexibility trimmed out. The Intel BID library is released in source form that can be built for Linux/Windows using common compilers to run on x86, x64 and IPF. Some of the flexibility provided by API of the Intel BID library include: passing by value or reference, explicit rounding behavior control, exception reporting, endianness etc.

Internal to Intel BID library, 128-bit and higher precision data are represented as arrays of qwords. For testing throughput of basic arithmetic operations, one of the task is to generate test bit patterns to characterize cycle characteristics. Different considerations come into play when considering Bid128_mul vs. Bid128_add.

Studying the source code of Intel BID library to understand its algorithmic and implementation aspects were quite a task, even when my scope is limited to one arithmetic operation at a time.

I was not interested in the flexibility of exception reporting, nor parameter passing choices, and I chose only to focus on round-nearest behavior as a proxy to capture the necessary algorithmic requirements.

So, I made some simplification choices: (a) extricate a proxy implementation of the target BID arithmetic library implementation that retain the functional, algorithmic and performance characteristic of the original library function implementation, (b) a calibration harness to correlate the actual performance of the extricated proxy implementation with off-the-shelf BID128 performance, (c) My test evaluation and vectorized POC need to run on both Windows and Linux.

The simplest calibration test code is simply using GCC's extension of _Decimal128 data type and standard operator '*" and '+' provided by that extension. But passing by value not only creates a portability problem but implicit data type conversion to/from _Decimal128 will invoke addition BID conversion routines, adding overhead.

So, my proxy of Bid128 arithmetic source implementation adopt passing by reference API of the Intel BID library and use the same data layout of arrays of qwords in little-endian on x64.

For Bid128_mul performance evaluations, the primary knob affecting cycles is the dynamic range of the mantissa of the two input value. The Bid128 encoding provides a maximum range of encodable mantissa of 34 decimal digits using 113 bits within the 128-bit container.

Hence the basic flow of testing Bid128_mul looks like

Extern void _my_Bid128_pack (__BID_UINT128 *pV128, int sign, __UINT64 qw_ho, __UINT64 qw_lo, int exp);

void Test_BID128_MUL( /*knob parameters for random pattern generation */)

{int sign1, exp1, sign2, exp2;

__UINT64 man_hi1, man_lo1, man_hi2, man_lo2;

__BID_UINT128 a128, b128;

__BID_UINT128 * pA = (__BID_UINT128 * ) a128, * pB = (__BID_UINT128 * ) b128;

/* generate desired mantisa bit patterns, exponent values, signs */

_my_Bid128_pack (&a128, sign1, man_hi1, man_lo1, exp1);

_my_Bid128_pack (&b128, sign2, man_hi2, man_lo2, exp2);

#ifdef _TARGET_GCC_LINUX_EVAL_

_Decimal128 ref, result, *a = (_Decimal128 *) pA, *b = (_Decimal128 *) pB;

ref = (*a) * (*b); // linked to gcc provided library code

result = _proxy_BID128_MUL( pA, pB); // link to locally compiled proxy code

// compare ref against result, exit if different

// measure thrupt of either ref = (*a) * (*b);

//result = _my_poc_BID128_MUL( pA, pB); // link to local vectorized poc

// measure trhupt of _proxy_BID128_MUL, my poc

#else

__BID_UINT128 result;

ref = _proxy_BID128_MUL( pA, pB); // link to locally compiled proxy code

result = _my_poc_BID128_MUL( pA, pB); // link to locally compiled vectorized poc code

// compare ref against result, exit if different

// measure trhupt of _proxy_BID128_MUL, my poc

#endif

}

SergeyKostrov · ‎07-30-2012

Quoting yuriisig

See the book: Handbook of Floating-Point Arithmetic...

Thank you! Just downloaded...

yuriisig · ‎07-30-2012

Quoting yuriisig

Intel fortran compiler works c quad, but speed very slow.

For example, speed of multiplication of matrixes decreases almost in 300 times in comparison with dgemm Intel MKL!!!It is absolutely unacceptable result.

SergeyKostrov · ‎08-01-2012

Thank you, Shih!

Quoting Shih Kuo (Intel)

...Internal to Intel BID library, 128-bit and higher precision data...

Could you clarify the statement, please? Does it mean that256-bit or 512-bit precisions are supported as well?

Best regards,
Sergey

akirkeby · ‎08-02-2012

Hi,

Thanks for your comments. Various scaling approaches are indeed sensible in many case. However, to clarify our challenge: The problem is exacerbated when dealing with low unit value currencies, but we encounter the same challenges with clients working in USD and EUR. The fundamental issue is not something which can easily be solved through a normalisation factor. Not least because the right factor is dependent on the specific context and would thus require application developer involvement rather than be transparent to the developer.

The problem is fundamentally that we run out of significant digits. The alternative, to lose some precision, is not deemed acceptable. In investment accounting specific rounding rules must be applied at specific points and early loss of precision canimpactfinal results visibly.

Thanks,
Anders

SHIH_K_Intel · ‎08-06-2012

Since the the mantissa of the input values can reach 113 bits in dynamic range, the immediate result of the multiprecision multiply of the two input mantissa needs an even larger container than 128-bits, as large as 256-bits.

Furthermore, the inmmediate result of the two input mantissa needs to be normalized in conjunction with desired rounding behavior to fit the IEEE-754 DFP spec defined encoding precision of 34 decimal digits. That is usually done by Montgomery reduction.

So the 256-bit container of the immediate product are multiplied again with a large-enough constant to perform 256-bit integer division to produce a quotient with at least 113 bits precision, in the extreme case.