Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
1126 Discussions

Hardware acceleration of Special Functions.

Bernard
Valued Contributor I
6,128 Views
Hi!
I would like to ask Intel's employees on this forum.Why IntelCPU architects have never implemented in hardware some of themore "popular" Special Functions like'GAMMA','BETA' and various 'BESSEL' functions of an integer order.All these functions could have been accessed byx87 ISAinstructions.
0 Kudos
70 Replies
Bernard
Valued Contributor I
1,122 Views

on a quad coreSandy Bridgeit will be something like 30M-60M 100 samples polygons per second(i.e. 1000x more) with a dumb Z-buffer algorithm (CPU only using all cores and fully vectorized AVX code

It is simply amazing how the CPU processing power increased over the period of 20 years.

>>here is an example with 49M polygons with per sample normal interpolation and reflection mapping

By the looking at the Robots surface what interpolation has been used in order to calculate the samples.
Was it bi-cubic interpolation , albeit costly but it can add significally smoother surface colourtransition.
0 Kudos
bronxzv
New Contributor II
1,122 Views

By the looking at the Robots surface what interpolation has been used in order to calculate the samples.
Was it bi-cubic interpolation


normals are bilinearly interpolated in world spaceand thereflection map isbilinearly interpolated in texture space, bicubic interpolation is useful mostly for texture magnification but will be overkill here IMHO
0 Kudos
Bernard
Valued Contributor I
1,122 Views

normals are bilinearly interpolated in world spaceand thereflection map isbilinearly interpolated in texture space, bicubic interpolation is useful mostly for texture magnification but will be overkill here IMHO

Still the Demo programmers were able to achieve smooth transitions.Maybe usage of bilinear interpolation is compensated by the high level of details and high frequency sampling?
0 Kudos
SergeyKostrov
Valued Contributor II
1,122 Views
Quoting iliyapolak

yes, in other words no FPU

...
Not so efficient for the real-time applications based on heavy usage of fp instructions.


29K family microcontrollers designed for embedded systems, like laser printers, scaners, X terminals and
these microcontrollersdon't have FPU in order to reduce a cost of system integration.Even if FP-instructions
on these microcontrollers cause "lightweight"Trapsonly ~3 clock cycles are needed tocomplete a vector fetch.

0 Kudos
bronxzv
New Contributor II
1,122 Views

Still the Demo programmers were able to achieve smooth transitions.Maybe usage of bilinear interpolation is compensated by the high level of details and high frequency sampling?

yes bilinear interpolation is fine for reflection maps since there is generally no magnification but a very high frequency samplingin texture space instead (due to the wild variation of normal directions), the sampling scheme is thus paramount for good quality, adaptive stochastic antialiasing in this example when you don't move the mouse
0 Kudos
bronxzv
New Contributor II
1,122 Views

integration. Even if FP-instructions on these microcontrollers cause "lightweight" Traps only ~3 clock cycles are needed to complete a vector fetch.


what was a "vector fetch" on such anancient purely scalar chip?

btw, do you knowhow many cycles were required foremulating basic fp instructions like FADD and FMUL? FMUL was particularly slow due to the lack of integer multiplier AFAIK
0 Kudos
Bernard
Valued Contributor I
1,123 Views

adaptive stochastic antialiasing in this example when you don't move the mouse


Adaptive stochastoc antialising is very good at minimizing computational cost and memory bandwidth,but at the cost of some irregular sampling pattern introduced by the random(stochastic) sampling.
What is the sampling filter used in the Robots demo?
Is this simple box filter or sinc filter?
0 Kudos
Bernard
Valued Contributor I
1,123 Views

29K family microcontrollers designed for embedded systems, like laser printers, scaners, X terminals and


For intensive floating-point application the better option is to use Texas Instruments SHARC microprocessors.
But even these DSP microprocessors do not have some special functions directly implemented in hardware.
I think that we can come to conclusion that none of the general purposeDSP implememtssuch functions in the hardware and microprocessors useinstead software libraries.
0 Kudos
bronxzv
New Contributor II
1,123 Views

What is the sampling filter used in the Robots demo?

Is this simple box filter or sinc filter?

the reconstruction filter is a box in this example, it's generally the best filter for low resolution raster images since other filterssuch as Gaussian andraised cosine lead to too much bluring and 2-3 lobes Lanczos too much ringing(note thatwe have these alternate reconstruction filters availablewith auserselectablefilter radius)

NB: sinc is a theoretical filter, not something you can use in practice with a realworld FIR filter kernel
0 Kudos
Bernard
Valued Contributor I
1,123 Views
Gussian filter in random sampling could lead to the better results than box filter.Gaussian curve falloff resembles more subtle changes in the brightness(colour) more like real-world varying fields of colour.The problem with the stochastic approach will be to choose the radius of blurring properly which could not take into account jagged pixels(area where the aliasing occures) because of randomness of sampling.
0 Kudos
bronxzv
New Contributor II
1,123 Views

Gussian filter in random sampling could lead to the better results than box filter.

at low sampling frequency yes (though I'll prefer raised cosine or Lanczos 2 over Gaussian), but with adaptive sampling you have a high sampling frequency in high frequency signal areas thus the local radius of your reconstruction filter must be very small to avoid excessive bluring and you will missin this case the farthestsamples (to the pixel center) within the pixel area

pixels on your screen arerectangular areas after all so a box reconstruction filter makes more sense in practice (with supersampling) thansome theoretical texts may let you think when reasoning about only discretereconstruction samples

anyway this is user selectable, and, since you have to ask, the default look is probably notthat bad
0 Kudos
Bernard
Valued Contributor I
1,123 Views
@bronxzv Slightly off topic question.In one of your posts you mentjoned that Kribi project has 500 timers based on rdtsc instruction.Is it possible to post the proper usage of rdtsc instruction in those timers.It could be posted as a code template. Best regards Iliya
0 Kudos
bronxzv
New Contributor II
1,122 Views

@bronxzv Slightly off topic question.In one of your posts you mentjoned that Kribi project has 500 timers based on rdtsc instruction.Is it possible to post the proper usage of rdtsc instruction in those timers.It could be posted as a code template. Best regards Iliya

IIRC this was mentioned in a private post

unfortunately this isa closed source framework so I'm not allowed to post source code from it

thekey advantage isthat itmakes it easy to install nested stopwatches in our source code with a simple (single line) notation, then after profile runs it reports detailed number of cycles and % in a nicely formatted report, for examplewith indentation for inner timings

for the actual usage of RDTSC it's simply using the advices from the paper I posted the other day, nothing special or innovative there
0 Kudos
SergeyKostrov
Valued Contributor II
1,123 Views
Quoting bronxzv

integration. Even if FP-instructions on these microcontrollers cause "lightweight" Traps only ~3 clock cycles are needed to complete a vector fetch.


what was a "vector fetch" on such anancient purely scalar chip?..

In another words, every time whenan interrupt or trap occurs an address of some routine has to obtained
from a 256-entry vector table.

>>...do you knowhow many cycles were required foremulating basic fp instructions like FADD and FMUL?..

No. I just checked a 29K familty User's Manual and I have not found any technical details regarding "number of cycles to execute an instruction".

Best regards,
Sergey
0 Kudos
Bernard
Valued Contributor I
1,122 Views

In another words, every time whenan interrupt or trap occurs an address of some routine has to obtained
from a 256-entry vector table

Why it is called "vector table".This is simply a data structure which holds a scalar values not vector values.
0 Kudos
bronxzv
New Contributor II
1,123 Views

In another words, every time when an interrupt or trap occurs an address of some routine has to obtained

from a 256-entry vector table.

so it's only (part of?) the time to branch to the trap hander, it tells us nothing about the speed of the actual routine

No. I just checked a 29K familty User's Man
ual and I have not found any technical details regarding "number of cycles to execute an instruction".
it was probably something like 50-100 cyclesforFP32 FMUL and FADD (based on my past experience writing FP emulation routines)
0 Kudos
bronxzv
New Contributor II
1,123 Views
0 Kudos
Bernard
Valued Contributor I
1,123 Views

http://en.wikipedia.org/wiki/Interrupt_vector

Yes I know this.
My question was slightly different.Members of IDT(IVT in DOS)are addresses i.e single binary number representing an address in the memory.Judging by the definition of the vector each IDT's entry should have been composed from a few values(addresses),but this is not the case.
I do not know why Intel decided to call it a vector.
0 Kudos
bronxzv
New Contributor II
1,110 Views

My question was slightly different.Members of IDT(IVT in DOS) are addresses i.e single binary number representing an address in the memory.Judging by the definition of the vector each IDT's entry should have been composed from a few values(addresses),but this is not the case.

ah I see what you mean, I have no idea why it's called a vector, I'll consider the whole table as a vector but noteach individual address as you said, unlike the common usage
0 Kudos
Bernard
Valued Contributor I
1,110 Views
It could be a vector when you consider the whole IDT.
Every IDT's "vector" point to the 8-byte descriptor which itself could be represented as a vector composed from various fields.
0 Kudos
SergeyKostrov
Valued Contributor II
1,110 Views
Quoting iliyapolak

In another words, every time whenan interrupt or trap occurs an address of some routine has to obtained
from a 256-entry vector table

Why it is called "vector table"...

I think AMDusesa "Vector Table" termbecause Intel calls a similar structure as an"Interrupt Descriptor Table".
It looks like this is a "War of Terms" and the same applies to Oracle and Informix, etc.
0 Kudos
Reply