- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I would like to ask Intel's employees on this forum.Why IntelCPU architects have never implemented in hardware some of themore "popular" Special Functions like'GAMMA','BETA' and various 'BESSEL' functions of an integer order.All these functions could have been accessed byx87 ISAinstructions.

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

To be honest: How often did you need such functions?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I liked your answer :) I know that trigonometric function(fsin fcos)are more useful than special functions mentioned by me,but there are various application that can benefit from hardware implementation of such a functions.How often did you need such functions

For example Bessel functions are used in signal processing and in wave propagation.

Gamma functions are used in statistics as gamma distribution.

These functions can be approximated by polynomial fit with pre-calculated coefficients and it is straightforward to implement in SSE technology when the high-precision(less than 80-bit) is needed.I suppose that CPU designers beign aware of such a functions andposibility to accurate approximate them in software simply decided to not implement it in hardware.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

As our tests have shown highly optimized SSE - based sine() function is almost as fast as x87 fsin.But the comparision was made to fsin which prabably implements in hardware range reduction.one key limitation of the legacy x87 transcendental functions is that they are scalar, for high performance code one will use SSEn or AVX software implementations because he benefits from a vectorized implementation (i.e. higher throughput)

I think that Intel could have implemented in microcode transcendental functions with the help of SSE technology.I mean creating SSE instruction which takes as an input single precision or double precision values and returns sine of these values , such a instruction is implemented in microcode.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

But the comparision was made to fsin which prabably implements in hardware range reduction

obviously any complete implementation will include a proper range reduction, that's the case for the x87 instructions and high performance vectorized implementations such as the MKL Vector Mathematical Functions Library [1] vsCos, vsSin, vsSincos, vsAcos, etc.

[1] http://software.intel.com/sites/products/documentation/hpc/mkl/vml/vmldata.htm

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Exactly this is the reason for not wasting silicon, especially when such a function is only very seldom needed.

> I think that Intel could have implemented in microcode...

Of course they could have - but why?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Yes thats true.Did you test MKL transcendentals?obviously any complete implementation will include a proper range reduction, that's the case for the x87 instructions and high performance vectorized implementations such as MKL vsCos, vsSin, vsSincos, vsAcos,

For example such a function like Gamma which is not periodic albeit its rate of grow is very fast.I think Intel could have implemented it in microcode as a SSE or AVX instructionit could have been even faster when coded as minimax approximated polynomial(elimination of dependency on exp and pow).

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I have to agree with you :)You are talking from practical point of view.If Intel were creating some custom DSP processor tailored for Bessel function's application it was probably mandatory for the enginers to implement it in hardware.But in case of Intel CPU when such a exotic functions can be efficiently approximated by SSE/AVX simplier instruction they did not waste silicon for this.Exactly this is the reason for not wasting silicon, especially when such a function is only very seldom needed

**@sirrida**

>>Of course they could have - but why

For this question I have no answer.As you have said that Intel engineers probably used silicon for more important things.For example branch-prediction logic.

>>Of course they could have - but why

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

no, I don't, but you can find very detailed performance data here:Yes thats true.Did you test MKL transcendentals?

http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/_performanceall.html

as stated by sirrida Idon't think it will be a good idea to waste chip area and/or validation budget (read: potential delays) for such specialized things in hardware, forthcoming FMA in Haswell will probably provide a strong boost to all polynomial/rational based approximations, and gather instructions willhelp table-based methods, it's the right way forward IMHO: powerful new general purpose instructions that let us speed up a lot of special cases

also the more you add functions the more you open the door to someone asking for yet another one, with an hardware based solution you talk about a 3+ years turnaround for just a new function and with software it's more like3 weeks

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thank you very much for posting this link.You made my day:)no, I don't, but you can find very detailed performance data here:

http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/_performanceall.html

As I already have seen vml gamma functions also is slow 123 clocks per value.I suppose that theydid not eliminate dependency on library calls.

**What static libraryimplements vlm?**If I had known this I would have been able to disassemble this library and try to understand their implementation.

**>>**

**as stated by sirrida Idon't think it will be a good idea to waste chip area and/or validation budget (read: potential delays) for such specialized things in hardware, forthcoming FMA in Haswell will probably provide a strong boost to all polynomial/rational based approximations, and gather instructions willhelp table-based methods, that's the right way forward IMHO: powerful new general purpose instructions that let us speed up a lot of special cases**

Yes very true.The ability to approximate such a functions with the help of SSE/AVX instructions is the argument against hardware implementation of special functions.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

indeed, I was thinking to it actually when mentioning validation, it will be real bad to delay or recall new CPUs due to a hard to catch microcode bug for an instruction used by 0.0001 % of the code baseTalking of validation, we don't need another fdiv bug...

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**@bronxzv**

I would like to ask you In what static library VLM is implemented in?I'am searching this directory on my computer: C:\Program Files\Intel\ComposerXE-2011\mkl\lib\ia32

There are many .lib files

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

"

3. LICENSE RESTRICTIONS:

[...]

B. You may NOT: [...] (v) reverse engineer, decompile, or disassemble the Materials;

"

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Sorry did not know this.reverse engineer, decompile, or disassemble the Materials

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

MKL tgamma results for 1000 random choosen doublevalues are 123 cycles very close to the my results.It is interesting what an approximation did they use?

They also were able to achieve 0.5 ulp of an accuracyeven on the problematic range[ 0.0001,1.0] maybe they have used Lanczos approximation?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

the poster with the best knowledge of MKL here isTimP AFAIK

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*...I would like to ask Intel's employees on this forum. Why IntelCPU architects have never implemented*

inhardware some of themore "popular" Special Functions like'GAMMA','BETA' and various 'BESSEL'

functions of an integer order. All these functions could have been accessed byx87 ISAinstructions.

inhardware some of themore "popular" Special Functions like'GAMMA','BETA' and various 'BESSEL'

functions of an integer order. All these functions could have been accessed byx87 ISAinstructions.

I don't think that Intel will add a such set of instructions. These functions are "Special" and they are not "Fundamental".

Intel clearly made a statement: "Use SSE or AVX to achieve as better as possible performance..."

Usually, big or small companies need to balance:

What markets demand?

What compatitors do?

What some customers want or expect?

And what is going on now? There is a growing demand on more powerful and energy efficient CPUs to run

bigger (!) versions of different "mobile"and "desktop" OSs.

Iliya, you mentioned a couple of times that some function calculates a result in ~120 clock cycles.

Let's put it on the LEFT side of some "Magic Scale". Let's assume that some bigcompany added a hardware

support for that special function in its CPUand it allows to get the same result in ~60 clock cycles. We put it

on the RIGHT side of our "Magic Scale". But, that is not everything and a cost, something like $500,000,000 USD,

will need to be added on the RIGHT side as well. This is because company needs to complete R&D, testing,

verifications, different production related tasks, and during these times salaries must be paid.

Personally, I would be glad to see a hardware accelerated matrix multiplication for matrices with

sizes up to 1,024x1,024.

Best regards,

Sergey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page