Quad precision floating point arithmetic with SSE/AVX?

Mikalai_K_Intel · ‎05-26-2011

Does latest x86 architecture offer native support for quad precision (QP)floating-point (FP)arithmetic?
If no,canQP be emulatedon XMM and YMM registers with small overhead (< 2X slowdown)compared to the double precision FP arithmetic?
Thanks,
Nick

TimP · ‎05-26-2011

No, the purpose of ymm register support is to increase simd parallelism for IEEE 32- and 64-bit data types; thus, the performance of those data types receives a further boost. The quad precision software floating point should speed up slightly, but it's still implemented in scalar integer arithmetic.

Mikalai_K_Intel · ‎05-27-2011

Tim,
Did you mean a "boost over the double precision floating point" instead if a "boost over thequadprecision floating point"?
Nick

TimP · ‎05-28-2011

See if I made myself any clearer when I edited the post. I meant to say that the native types get more performance enhancement than the software quad precision floating point.
Thanks,

akirkeby · ‎06-18-2012

Tim,

are there any plans to incorporate native hardware-based support for quad precision, at least into Xeon processors? The performance ofthe pure software implementation is generally too slow for our purposes - mostly large regular financial summation tasks. If not, is there a forum of sorts where one can register interest in hardware support for quad precision in Xeon processors?

Thanks,
Anders

TimP · ‎06-18-2012

I'm not aware of any serious move toward hardware quad precision support, or even of any marketing analysis of the extent of financial use of quad precision. I imagine this would require working through an Intel customer support account.
As this would be a long term project (years), I hope you are working with the current implementations of parallelism.

SHIH_K_Intel · ‎06-19-2012

Hi Anders
If by quad-precision you have 128-bit Binary Integer Decimal in mind or as candidate for consideration (BID encoding can deal with rounding and precision propagation issues better than binary FP encoding, Intel's DFP library is a great place to start.
You might want to contact the leader of Intel DFP library (he may be able to brief you of future release plan for that library.
http://software.intel.com/en-us/articles/intel-decimal-floating-point-math-library/?wapkw=decimal+floating+point

You can also contact me offline to explore potential performance headroom on second andthird generation intel core processorsor Intel Xeon E3 and E5 processors.

Shihjong

SergeyKostrov · ‎06-19-2012

Quoting akirkeby

...The performance ofthe pure software implementation is generally too slow for our purposes - mostly large
regular financial summation tasks...

Could you explain why do you need a 128-bit precision in that case?

Rounding problemscreatereal troubles in case of exchange operations and it would beinteresting to understand
what your problem is.

Best regards,
Sergey

Bernard · ‎06-27-2012

Could you explain why do you need a 128-bit precision in that case

Sometimes it could be useful.When youdeal with the speed of execution vs precision and you do not want the arbitrary precision implementation which is slower than hardware registers.
For example the value of Pi which is transcendental number with infinite precision and it could benefit from the wider fp registers so range-reduction algorithms could provide more accurate mapping of the large arguments to the suitable range of sine calcualtion.

TimP · ‎06-28-2012

Quoting iliyapolak

Could you explain why do you need a 128-bit precision in that case
Sometimes it could be useful.When youdeal with the speed of execution vs precision and you do not want the arbitrary precision implementation which is slower than hardware registers.
For example the value of Pi which is transcendental number with infinite precision and it could benefit from the wider fp registers so range-reduction algorithms could provide more accurate mapping of the large arguments to the suitable range of sine calcualtion.

This range reduction has been a subject of extensive research, and practical solutions have been implemented which don't rely on extra hardware precision. Anyway, to justify the investment in a higher precision, among other things a corresponding math function library is required, requiring yet again a higher precision algorithm for range reduction.
You could find plenty of references on the limitations of simply relying on extra precision for range reduction, as the x87 firmware does.

Bernard · ‎06-28-2012

So you think that there is no need for quad-precision fphardware registers to speed upincreased precision calculation.Arbitrary precision(to some point) could also benefit albeit partly from increased precision hardware registers.I think thatit won't beat the memory array based arbitrary precision in the terms of precision needed to represent some numbers very accurately,but in some cases it would speed up the calculation.

TimP · ‎06-28-2012

I didn't say there is no need for quad precision. All widely used Fortran compilers have it, for example, with software implementation. Performance deficiency of current quad precision is due as much to lack of vectorizability as lack of single hardware instruction implementation.
My point was that no matter how much hardware precision you have, you still need a higher precision range reduction algorithm to support trig functions on your new high precision.
If the market demand were seen, no doubt someone would study the feasibility of vector quad precision on future 256- and 512-bit register platforms.

yuriisig · ‎06-28-2012

The fastest algorithms use IEEE 754.

Bernard · ‎06-28-2012

I didn't say there is no need for quad precision. All widely used Fortran compilers have it, for example, with software implementation. Performance deficiency of current quad precision is due as much to lack of vectorizability as lack of single hardware instruction

I agree with you on this.We must also ask for what purpose should the hardware and ISA be modified to implement quadprecision or even more.I suppose that thereare not many mainstream math or engineeringapplicationthat need to calculatequad precisiontranscend. functions values.And for those esoteric application or highly sofisticated math packages(Mathematica ,Matlab)which calculates trig function with arbitrary precisionthe memory array model will be the best implementation albeit at theprice of speed of execution.

>>you still need a higher precision range reduction algorithm to support trig functions on your new high precision

It is catch-22 situation.

SergeyKostrov · ‎06-28-2012

Quoting TimP (Intel)

I didn't say there is no need for quad precision...

Borland C++ compiler v5.xincludes a BCD Number Library and it allows to work with numbers up to 5,000 digits. A question is:

ShouldI wait for a hardware support of 256-bit or 512-bit precisionsif some workaround could be used?

Also, having workedin financial industry for many years I could say thataccuracy of calculations ismore important than speed.

Bernard · ‎06-28-2012

Borland C++ compiler v5.xincludes a BCD Number Library and it allows to work with numbers up to 5,000 digits.

Java also has two arbitrary precision classes: Big Integer and Big Decimal.But it is unintuitive to work with these classes because numerical primitives like float or int are represented by objects and so simple arithmketic operations are done on objects so you have a large overhead of memory space needed to store them and time when you are doing calculation is very slow even hundreds times slower than in the case of arithmetics done on primitive types.
The question is what kind of applications beside some esoteric pure mathematical soft which calculates Pi untill thousands of digits and sophisticated math packages like Mathematica needs such a precision.

TimP · ‎06-29-2012

The gnu multiple precision libraries gmp mpc mpfr are used in the gnu compilers. These libraries presumably are more efficient than high precision decimal libraries, such as gnu libbid, which are favored in many monetary applications. The libbid was designed to support compilation to target either a firmware or the software library decimal implementation, including the decimal mode support for C. These libraries are sufficiently important that they get some consideration in CPU hardware design and are unlikely to be replaced by complete firmware/hardware implementations.

sirrida · ‎06-29-2012

If speed really matters you should not program in Java.
In e.g. C++ you have structs/classes without necessarily needing heap space, and you have operator overloading. Also, some C(++) compilers (e.g. gcc: __int128) allow for 128 bit integers. Intel's C compiler also knows about some kind of 128 bit floats which are emulated quite efficiently.

Bernard · ‎06-29-2012

Now I am porting my special functions library from Java to C++.As my tests have shown when fully optimized by Intel compiler native code is two or even three times faster than the same code written in java.In my previous post I wrote about the arbitrary precision Java classes and the problems of using an objects to perform simple arithmetic operations on arbitrary precision numbers.I did not test and see the C++ implementation of arbitrary precision classes but when it is based on objects to represent a primitive types and uses counterintuitive approach to perform simple arithmetics on objects it will be also very slow maybe not so slow like a java but not fast like hardware based arbitrary precision.

akirkeby · ‎07-17-2012

Thanks all for your comments so far. Been away for a bit so let me try to answer all comments and questions to add contextin one bigpost:

The business problem is that 14-15 significant digits is not enough to retain sufficient precision for amounts of moeny in bookkeeping applications where large transaction volumes are totaled up on a regular basis. The problem is particularly apparent when adding low unit value currencies such as VND or IDR.

The numbers in playare not typically integers, although for simple summation they could be shifted a few digits but this would only solve some of the applications and thus adds to the overall complexity.

Our current solution under investigation is a software implementation of a 128-bit decimal type based on IEEE 754-2008. The performance so far is 1-2 orders of maginitude slower than the corresponding 64-bit data types currently used.

Since our software is deployed in Windows environments the only alternative to a software implementation I currently see is the FPGA route. But that's not particularly attractive as FPGA hardware it would have to be installed in bulk on servers is outsourced data centres at substantial cost.

I'm aware that asking for 128-bit precision support at CPU level is a request for the long-term. However, with the current performance penalty we see from the software implementation it is clear that while it may work in limited areas for a whileit will never be something we or our clients will be happy with.

Thanks

SergeyKostrov · ‎07-17-2012

Quoting akirkeby

...The problem is particularly apparent when adding low unit value currencies such as VND or IDR.

The numbers in playare not typically integers, although for simple summation they could be shifted a few digits but this would
only solve some of the applications and thus adds to the overall complexity...

[SergeyK] VND is a Vietnamise Dong and acurrent exchange rate is about $4.8*10^-5 USD ( 0.0000479846 ).

Believe me, that thecurrency exchange problem is solved for years ( since first PCs appeared at banks)
by introducing a normalization factor ( or a CurrencyUnit )and in case of VND it has to be equal to10^5.

Another way is to do calculations in aBase Currency and usuallythis is USD ( $100 USD = 1 / 0.0000479846 VND * 100).

I would like to repeat that something is really wrong conceptually with a way summations
are done in your software. It could be alsorelated to anot efficientdatabase design.

...I currently see is the FPGA route. But that's not particularly attractive as FPGA hardware it would have to be installed
in bulk on servers is outsourced data centres at substantial cost...

[SergeyK] That "FPGA solution"is clearlynot the best one and I make that statement because I worked
as a C++ Software Developer for the Financial Industry for more than 8 years and
was involvedina design and implementation of several financial systems (two of them were
Certifiedata National Bank of some country).

Best regards,
Sergey