Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
1095 Discussions

Quad precision floating point arithmetic with SSE/AVX?

Mikalai_K_Intel
Employee
3,530 Views
Does latest x86 architecture offer native support for quad precision (QP)floating-point (FP)arithmetic?
If no,canQP be emulatedon XMM and YMM registers with small overhead (< 2X slowdown)compared to the double precision FP arithmetic?
Thanks,
Nick
0 Kudos
39 Replies
Bernard
Valued Contributor I
651 Views

Our current solution under investigation is a software implementation of a 128-bit decimal type based on IEEE 754-2008. The performance so far is 1-2 orders of maginitude slower than the corresponding 64-bit data types currently used

Without native hardware acceleration nothing can be done to improve the performance i.e eliminate store/lode overhead needed to represent 128-bit number implemented in software.
0 Kudos
TimP
Honored Contributor III
651 Views
The quad precision binary floating point types implemented by software in the popular compilers should be faster than decimal types of similar size. If register data locality is required, to avoid memory bandwidth limitations, hardware or firmware instruction implementation should have an advantage. This would take several years of evidence of application level value and hardware development.
0 Kudos
SHIH_K_Intel
Employee
651 Views

Although I cant speak to the state of art with respect to 128-bit binary FP software libraries, I can share some insights about performance improvements of IEE754-2008 BID floating-point math that's readily feasible without 128-bit FP hardware support.

If you look at how these FP numbers are encoded, whether its hw or software solutions, they must deal with

1. Bit-fields extractions
2. Special case pruning of INF/NANs
3. Binary arithmetic operations on large bit strings to achieve the precision provided in the width of the mantissa
4. Normalization of mantissa/exponents and required rounding operations
5. Fix-up and packing bit fields back into IEEE encoding.

Whether it's hardware or software, these result-dependency chains are hard to get around. I would venture to suggest any expectation of not more than an order of magnitude slow down, when committing to use quad-precision, is unrealistic.

In my experiences with Intel's Binary Integer Decimal (BID) FP library, basic arithmetic operation cycles does take more 10x longer, if you want to compare to the hardware accelerated 64-bit Binary FP encoding. Bear in mind that the operational cycle of arithmetic are value-dependent, the most cycle-consuming BID128_add can take 180ish cycles, while typically it's more like 80ish. On the other hand, the most cycle-consuming case of BID128_mul can take more than 400 cycles, while mid 200 is more typical.

Without 128-bit FP hardware, there are several places that algorithmic/existing ISA and application architecture can make large speed up.

Take BID128_mul for example, multi-precision arithmetic using existing ISA's 64-bit only MUL instruction to produce 128-bit result will have the strongest impact on accelerating BID128_MUL. Additional gains will follow if using the MULX instruction that will be in the market in 2013.

The chore of initial bit field extraction and special case pruning can be handled using existing SSE instructions, so that the lengthy multi-precision math code can start sooner.

The net result is on Sandy Bridge based processor, the most cycle-consuming BID128_MUL case would take less than 200 cycles and typical cases are well below 100 cycles, without 128-bit FP hardware and no new ISA.

It may be feasible in an application that the special case input ranges can be handled at a different stage than relying on an arithmetic library function to implement the defensive pre-processing stage, as typical software library need to do.

0 Kudos
SergeyKostrov
Valued Contributor II
651 Views

Although I cant speak to the state of art with respect to 128-bit binary FP software libraries, I can share some insights about
performance improvements of IEE754-2008 BID floating-point math that's readily feasible without 128-bit FP hardware support.

If you look at how these FP numbers are encoded, whether its hw or software solutions, they must deal with

1. Bit-fields extractions
2. Special case pruning of INF/NANs
3. Binary arithmetic operations on large bit strings to achieve the precision provided in the width of the mantissa
4. Normalization of mantissa/exponents and required rounding operations
5. Fix-up and packing bit fields back into IEEE encoding...


Well, could you try to explain this to a person who does accounting in some company that is in a Currency Exchange
business? Or, have a chat with somebodywho does accounting in Intel. I'll bevery glad to hear a response
from that person.

Intel Software Engineers, please try to look Out-Of-The-Box.

Nick has a real life problem related to a "Rounding of a Financial Transaction". Even if Nick's company spends
many millions of dollars on FPGAs, or 128-bit/256-bit/etc precision library, it won't fix a really simple problem and
take a look at:

http://www.irishwebmasterforum.com/coding-help/5997-accounting-for-rounding-errors.html

Please respond me how a "magical" 128-bit precision hardware will solve that problem?

Some big companies, like SAP, havevery flexible rules on how to do roundings and take a look:

http://help.sap.com/saphelp_rc10/helpdata/en/18/8b8a3a068ada7fe10000000a114084/content.htm

Or, take a look at:

http://blog.acrossecurity.com/2012/01/is-your-online-bank-vulnerable-to.html
http://docs.oracle.com/cd/A60725_05/html/comnls/us/mrc/currco01.htm

Thanks in advance for your time!

Best regards,
Sergey

0 Kudos
jimdempseyatthecove
Honored Contributor III
651 Views
Sergey's suggestion of a normalization factor is good.
Another route to consider is to choose a monetary quanta and perform all calculations in units of quanta.

A quanta would be defined as an indivisible unit of money. An example of choice would be $1.0e-8.

This is approximately 1.0e-3VND. 64 bits then could handle amounts up to ~+/-$35,000 Trillion US$.
Round off error of ~ 1/1000th of 1VND might be acceptable. Even 1 million carefully crafted transactions couldn't skew the result by more than 1 cent.

Periodically the quanta could be deflated (the normalization factor Sergey was talking about). At some date a value for 1 quanta is chosen and defined as having a normalization factor of 0. Then at some future date (assuming inflation) you could then declare we are now using the :new" quanta with normalization factor of 1 wit respect to first generation quanta.

Jim Dempsey
0 Kudos
SergeyKostrov
Valued Contributor II
651 Views
...In my experiences with Intel's Binary Integer Decimal (BID) FP library...


Could you provide a test-case or link(s) to binaries / sources of the Intel BID FP library? Thanks.

Best regards,
Sergey

0 Kudos
TimP
Honored Contributor III
651 Views
Intel decimal library is included in the gcc source distribution.
0 Kudos
SergeyKostrov
Valued Contributor II
651 Views
Quoting TimP (Intel)
Intel decimal library is included in the gcc source distribution.


Thank you, Tim! Are there plans to include Intel BID libraryin Intel C++ compiler for Windows?

0 Kudos
TimP
Honored Contributor III
651 Views
The claims for the netlib version of the library imply you are entitled to try it yourself.
0 Kudos
SergeyKostrov
Valued Contributor II
651 Views
Quoting TimP (Intel)
The claims for the netlib version of the library imply you are entitled to try it yourself.


This is exactly what I was looking for. Thank you, Tim!

Now I need to schedule some timefor R&D and I will compare Intel BID library with Borland BCD Number library.

Best regards,
Sergey

0 Kudos
yuriisig
Beginner
651 Views

Dekker's method for doubled-single extended gives accuracy of 128 bits. Faster this algorithm for such accuracy is not present.

0 Kudos
SergeyKostrov
Valued Contributor II
651 Views
Quoting yuriisig
Dekker's method for doubled-single extended gives accuracy of 128 bits...


Yurii, I can't find any references for that method on the Internet (Google search was used ). Could youprovide me
with internetlinks or docs, please? Thanks in advance.

PS: I've found this http://en.wikipedia.org/wiki/Dekker's_algorithmbut it is a different one andfor a concurrent programming.

0 Kudos
yuriisig
Beginner
651 Views

See the book: Handbook of Floating-Point Arithmetic. Intel fortran compiler works c quad, but speed very slow.

0 Kudos
SHIH_K_Intel
Employee
651 Views

Hi

I can sketch the test approach of my study, which was geared towards uncovering opportunities of vectorization and native ISA performance headroom that were not exploited. This is different from typical usage of library users. But some parts may be of interest to you.

As a background, GCC supports its own data type, _Decimal128, which maps to BID128 when built for x64 architecture. GCC's native language support for _Decimal128 extension on x64 architecture is essentially wrapped on top of Intel BID library with some of the flexibility trimmed out. The Intel BID library is released in source form that can be built for Linux/Windows using common compilers to run on x86, x64 and IPF. Some of the flexibility provided by API of the Intel BID library include: passing by value or reference, explicit rounding behavior control, exception reporting, endianness etc.

Internal to Intel BID library, 128-bit and higher precision data are represented as arrays of qwords. For testing throughput of basic arithmetic operations, one of the task is to generate test bit patterns to characterize cycle characteristics. Different considerations come into play when considering Bid128_mul vs. Bid128_add.

Studying the source code of Intel BID library to understand its algorithmic and implementation aspects were quite a task, even when my scope is limited to one arithmetic operation at a time.

I was not interested in the flexibility of exception reporting, nor parameter passing choices, and I chose only to focus on round-nearest behavior as a proxy to capture the necessary algorithmic requirements.

So, I made some simplification choices: (a) extricate a proxy implementation of the target BID arithmetic library implementation that retain the functional, algorithmic and performance characteristic of the original library function implementation, (b) a calibration harness to correlate the actual performance of the extricated proxy implementation with off-the-shelf BID128 performance, (c) My test evaluation and vectorized POC need to run on both Windows and Linux.

The simplest calibration test code is simply using GCC's extension of _Decimal128 data type and standard operator '*" and '+' provided by that extension. But passing by value not only creates a portability problem but implicit data type conversion to/from _Decimal128 will invoke addition BID conversion routines, adding overhead.

So, my proxy of Bid128 arithmetic source implementation adopt passing by reference API of the Intel BID library and use the same data layout of arrays of qwords in little-endian on x64.

For Bid128_mul performance evaluations, the primary knob affecting cycles is the dynamic range of the mantissa of the two input value. The Bid128 encoding provides a maximum range of encodable mantissa of 34 decimal digits using 113 bits within the 128-bit container.

Hence the basic flow of testing Bid128_mul looks like

Extern void _my_Bid128_pack (__BID_UINT128 *pV128, int sign, __UINT64 qw_ho, __UINT64 qw_lo, int exp);

void Test_BID128_MUL( /*knob parameters for random pattern generation */)

{int sign1, exp1, sign2, exp2;

__UINT64 man_hi1, man_lo1, man_hi2, man_lo2;

__BID_UINT128 a128, b128;

__BID_UINT128 * pA = (__BID_UINT128 * ) a128, * pB = (__BID_UINT128 * ) b128;

/* generate desired mantisa bit patterns, exponent values, signs */

_my_Bid128_pack (&a128, sign1, man_hi1, man_lo1, exp1);

_my_Bid128_pack (&b128, sign2, man_hi2, man_lo2, exp2);

#ifdef _TARGET_GCC_LINUX_EVAL_

_Decimal128 ref, result, *a = (_Decimal128 *) pA, *b = (_Decimal128 *) pB;

ref = (*a) * (*b); // linked to gcc provided library code

result = _proxy_BID128_MUL( pA, pB); // link to locally compiled proxy code

// compare ref against result, exit if different

// measure thrupt of either ref = (*a) * (*b);

//result = _my_poc_BID128_MUL( pA, pB); // link to local vectorized poc

// measure trhupt of _proxy_BID128_MUL, my poc

#else

__BID_UINT128 result;

ref = _proxy_BID128_MUL( pA, pB); // link to locally compiled proxy code

result = _my_poc_BID128_MUL( pA, pB); // link to locally compiled vectorized poc code

// compare ref against result, exit if different

// measure trhupt of _proxy_BID128_MUL, my poc

#endif

}

0 Kudos
SergeyKostrov
Valued Contributor II
651 Views
Quoting yuriisig
See the book: Handbook of Floating-Point Arithmetic...


Thank you! Just downloaded...

0 Kudos
yuriisig
Beginner
651 Views
Quoting yuriisig
Intel fortran compiler works c quad, but speed very slow.

For example, speed of multiplication of matrixes decreases almost in 300 times in comparison with dgemm Intel MKL!!!It is absolutely unacceptable result.

0 Kudos
SergeyKostrov
Valued Contributor II
651 Views
Thank you, Shih!

Quoting Shih Kuo (Intel)
...Internal to Intel BID library, 128-bit and higher precision data...

Could you clarify the statement, please? Does it mean that256-bit or 512-bit precisions are supported as well?

Best regards,
Sergey
0 Kudos
akirkeby
Beginner
651 Views
Hi,

Thanks for your comments. Various scaling approaches are indeed sensible in many case. However, to clarify our challenge: The problem is exacerbated when dealing with low unit value currencies, but we encounter the same challenges with clients working in USD and EUR. The fundamental issue is not something which can easily be solved through a normalisation factor. Not least because the right factor is dependent on the specific context and would thus require application developer involvement rather than be transparent to the developer.

The problem is fundamentally that we run out of significant digits. The alternative, to lose some precision, is not deemed acceptable. In investment accounting specific rounding rules must be applied at specific points and early loss of precision canimpactfinal results visibly.

Thanks,
Anders
0 Kudos
SHIH_K_Intel
Employee
651 Views

Since the the mantissa of the input values can reach 113 bits in dynamic range, the immediate result of the multiprecision multiply of the two input mantissa needs an even larger container than 128-bits, as large as 256-bits.

Furthermore, the inmmediate result of the two input mantissa needs to be normalized in conjunction with desired rounding behavior to fit the IEEE-754 DFP spec defined encoding precision of 34 decimal digits. That is usually done by Montgomery reduction.

So the 256-bit container of the immediate product are multiplied again with a large-enough constant to perform 256-bit integer division to produce a quotient with at least 113 bits precision, in the extreme case.

0 Kudos
Reply