Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.
1136 Discussions

Hardware acceleration of Special Functions.

Bernard
Valued Contributor I
9,037 Views
Hi!
I would like to ask Intel's employees on this forum.Why IntelCPU architects have never implemented in hardware some of themore "popular" Special Functions like'GAMMA','BETA' and various 'BESSEL' functions of an integer order.All these functions could have been accessed byx87 ISAinstructions.
0 Kudos
70 Replies
Bernard
Valued Contributor I
4,219 Views
Is anyone interested in this question?
0 Kudos
sirrida
Beginner
4,219 Views
I am not affiliated with Intel but I strongly suspect that the silicon for that is much better spent elsewhere.
To be honest: How often did you need such functions?
0 Kudos
Bernard
Valued Contributor I
4,219 Views

How often did you need such functions

I liked your answer :) I know that trigonometric function(fsin fcos)are more useful than special functions mentioned by me,but there are various application that can benefit from hardware implementation of such a functions.
For example Bessel functions are used in signal processing and in wave propagation.
Gamma functions are used in statistics as gamma distribution.
These functions can be approximated by polynomial fit with pre-calculated coefficients and it is straightforward to implement in SSE technology when the high-precision(less than 80-bit) is needed.I suppose that CPU designers beign aware of such a functions andposibility to accurate approximate them in software simply decided to not implement it in hardware.
0 Kudos
bronxzv
New Contributor II
4,219 Views
one key limitation of the legacy x87 transcendental functions is that they are scalar, for high performance code one will use SSEn or AVX software implementations because he benefits from a vectorized implementation (i.e. higher throughput)
0 Kudos
Bernard
Valued Contributor I
4,219 Views

one key limitation of the legacy x87 transcendental functions is that they are scalar, for high performance code one will use SSEn or AVX software implementations because he benefits from a vectorized implementation (i.e. higher throughput)

As our tests have shown highly optimized SSE - based sine() function is almost as fast as x87 fsin.But the comparision was made to fsin which prabably implements in hardware range reduction.
I think that Intel could have implemented in microcode transcendental functions with the help of SSE technology.I mean creating SSE instruction which takes as an input single precision or double precision values and returns sine of these values , such a instruction is implemented in microcode.
0 Kudos
bronxzv
New Contributor II
4,219 Views

But the comparision was made to fsin which prabably implements in hardware range reduction


obviously any complete implementation will include a proper range reduction, that's the case for the x87 instructions and high performance vectorized implementations such as the MKL Vector Mathematical Functions Library [1] vsCos, vsSin, vsSincos, vsAcos, etc.

[1] http://software.intel.com/sites/products/documentation/hpc/mkl/vml/vmldata.htm
0 Kudos
sirrida
Beginner
4,219 Views
> As our tests have shown highly optimized SSE - based sine() function is almost as fast as x87 fsin.
Exactly this is the reason for not wasting silicon, especially when such a function is only very seldom needed.

> I think that Intel could have implemented in microcode...
Of course they could have - but why?
0 Kudos
Bernard
Valued Contributor I
4,219 Views

obviously any complete implementation will include a proper range reduction, that's the case for the x87 instructions and high performance vectorized implementations such as MKL vsCos, vsSin, vsSincos, vsAcos,

Yes thats true.Did you test MKL transcendentals?
For example such a function like Gamma which is not periodic albeit its rate of grow is very fast.I think Intel could have implemented it in microcode as a SSE or AVX instructionit could have been even faster when coded as minimax approximated polynomial(elimination of dependency on exp and pow).
0 Kudos
Bernard
Valued Contributor I
4,219 Views

Exactly this is the reason for not wasting silicon, especially when such a function is only very seldom needed

I have to agree with you :)You are talking from practical point of view.If Intel were creating some custom DSP processor tailored for Bessel function's application it was probably mandatory for the enginers to implement it in hardware.But in case of Intel CPU when such a exotic functions can be efficiently approximated by SSE/AVX simplier instruction they did not waste silicon for this.
@sirrida
>>Of course they could have - but why
For this question I have no answer.As you have said that Intel engineers probably used silicon for more important things.For example branch-prediction logic.
0 Kudos
bronxzv
New Contributor II
4,219 Views

Yes thats true.Did you test MKL transcendentals?

no, I don't, but you can find very detailed performance data here:
http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/_performanceall.html

as stated by sirrida Idon't think it will be a good idea to waste chip area and/or validation budget (read: potential delays) for such specialized things in hardware, forthcoming FMA in Haswell will probably provide a strong boost to all polynomial/rational based approximations, and gather instructions willhelp table-based methods, it's the right way forward IMHO: powerful new general purpose instructions that let us speed up a lot of special cases

also the more you add functions the more you open the door to someone asking for yet another one, with an hardware based solution you talk about a 3+ years turnaround for just a new function and with software it's more like3 weeks
0 Kudos
sirrida
Beginner
4,219 Views
Talking of validation, we don't need another fdiv bug...
0 Kudos
Bernard
Valued Contributor I
4,219 Views

no, I don't, but you can find very detailed performance data here:
http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/_performanceall.html

Thank you very much for posting this link.You made my day:)
As I already have seen vml gamma functions also is slow 123 clocks per value.I suppose that theydid not eliminate dependency on library calls.What static libraryimplements vlm?If I had known this I would have been able to disassemble this library and try to understand their implementation.

>>as stated by sirrida Idon't think it will be a good idea to waste chip area and/or validation budget (read: potential delays) for such specialized things in hardware, forthcoming FMA in Haswell will probably provide a strong boost to all polynomial/rational based approximations, and gather instructions willhelp table-based methods, that's the right way forward IMHO: powerful new general purpose instructions that let us speed up a lot of special cases

Yes very true.The ability to approximate such a functions with the help of SSE/AVX instructions is the argument against hardware implementation of special functions.
0 Kudos
bronxzv
New Contributor II
4,219 Views

Talking of validation, we don't need another fdiv bug...

indeed, I was thinking to it actually when mentioning validation, it will be real bad to delay or recall new CPUs due to a hard to catch microcode bug for an instruction used by 0.0001 % of the code base
0 Kudos
Bernard
Valued Contributor I
4,219 Views
@bronxzv
I would like to ask you In what static library VLM is implemented in?I'am searching this directory on my computer: C:\Program Files\Intel\ComposerXE-2011\mkl\lib\ia32
There are many .lib files
0 Kudos
bronxzv
New Contributor II
4,219 Views
I have actually bought this product and accepted its license (excerpt below)

"
3. LICENSE RESTRICTIONS:
[...]
B. You may NOT: [...] (v) reverse engineer, decompile, or disassemble the Materials;
"
0 Kudos
Bernard
Valued Contributor I
4,219 Views

reverse engineer, decompile, or disassemble the Materials

Sorry did not know this.
0 Kudos
Bernard
Valued Contributor I
4,219 Views
@bronxzv

MKL tgamma results for 1000 random choosen doublevalues are 123 cycles very close to the my results.It is interesting what an approximation did they use?
They also were able to achieve 0.5 ulp of an accuracyeven on the problematic range[ 0.0001,1.0] maybe they have used Lanczos approximation?
0 Kudos
bronxzv
New Contributor II
4,219 Views
as I confessed the other day I have zero experience with Gamma functions,these look pretty much like strange animals to me

the poster with the best knowledge of MKL here isTimP AFAIK
0 Kudos
Bernard
Valued Contributor I
4,219 Views
I will postthis question on the MKL forum, but I'am afraid that nobody will answer the question regarding implementation of an algorithm.
0 Kudos
SergeyKostrov
Valued Contributor II
4,091 Views
Quoting iliyapolak
...I would like to ask Intel's employees on this forum. Why IntelCPU architects have never implemented
inhardware some of themore "popular" Special Functions like'GAMMA','BETA' and various 'BESSEL'
functions of an integer order. All these functions could have been accessed byx87 ISAinstructions.


I don't think that Intel will add a such set of instructions. These functions are "Special" and they are not "Fundamental".
Intel clearly made a statement: "Use SSE or AVX to achieve as better as possible performance..."

Usually, big or small companies need to balance:

What markets demand?
What compatitors do?
What some customers want or expect?

And what is going on now? There is a growing demand on more powerful and energy efficient CPUs to run
bigger (!) versions of different "mobile"and "desktop" OSs.

Iliya, you mentioned a couple of times that some function calculates a result in ~120 clock cycles.
Let's put it on the LEFT side of some "Magic Scale". Let's assume that some bigcompany added a hardware
support for that special function in its CPUand it allows to get the same result in ~60 clock cycles. We put it
on the RIGHT side of our "Magic Scale". But, that is not everything and a cost, something like $500,000,000 USD,
will need to be added on the RIGHT side as well. This is because company needs to complete R&D, testing,
verifications, different production related tasks, and during these times salaries must be paid.

Personally, I would be glad to see a hardware accelerated matrix multiplication for matrices with
sizes up to 1,024x1,024.

Best regards,
Sergey

0 Kudos
Reply