Hardware acceleration of Special Functions. - Page 3

Bernard · ‎06-30-2012

Hi!
I would like to ask Intel's employees on this forum.Why IntelCPU architects have never implemented in hardware some of themore "popular" Special Functions like'GAMMA','BETA' and various 'BESSEL' functions of an integer order.All these functions could have been accessed byx87 ISAinstructions.

Bernard · ‎07-10-2012

on a quad coreSandy Bridgeit will be something like 30M-60M 100 samples polygons per second(i.e. 1000x more) with a dumb Z-buffer algorithm (CPU only using all cores and fully vectorized AVX code

It is simply amazing how the CPU processing power increased over the period of 20 years.

>>here is an example with 49M polygons with per sample normal interpolation and reflection mapping

By the looking at the Robots surface what interpolation has been used in order to calculate the samples.
Was it bi-cubic interpolation , albeit costly but it can add significally smoother surface colourtransition.

bronxzv · ‎07-10-2012

By the looking at the Robots surface what interpolation has been used in order to calculate the samples.
Was it bi-cubic interpolation

normals are bilinearly interpolated in world spaceand thereflection map isbilinearly interpolated in texture space, bicubic interpolation is useful mostly for texture magnification but will be overkill here IMHO

Bernard · ‎07-10-2012

normals are bilinearly interpolated in world spaceand thereflection map isbilinearly interpolated in texture space, bicubic interpolation is useful mostly for texture magnification but will be overkill here IMHO

Still the Demo programmers were able to achieve smooth transitions.Maybe usage of bilinear interpolation is compensated by the high level of details and high frequency sampling?

SergeyKostrov · ‎07-10-2012

Quoting iliyapolak

yes, in other words no FPU

...
Not so efficient for the real-time applications based on heavy usage of fp instructions.

29K family microcontrollers designed for embedded systems, like laser printers, scaners, X terminals and
these microcontrollersdon't have FPU in order to reduce a cost of system integration.Even if FP-instructions
on these microcontrollers cause "lightweight"Trapsonly ~3 clock cycles are needed tocomplete a vector fetch.

bronxzv · ‎07-10-2012

Still the Demo programmers were able to achieve smooth transitions.Maybe usage of bilinear interpolation is compensated by the high level of details and high frequency sampling?

yes bilinear interpolation is fine for reflection maps since there is generally no magnification but a very high frequency samplingin texture space instead (due to the wild variation of normal directions), the sampling scheme is thus paramount for good quality, adaptive stochastic antialiasing in this example when you don't move the mouse

bronxzv · ‎07-10-2012

integration. Even if FP-instructions on these microcontrollers cause "lightweight" Traps only ~3 clock cycles are needed to complete a vector fetch.

what was a "vector fetch" on such anancient purely scalar chip?

btw, do you knowhow many cycles were required foremulating basic fp instructions like FADD and FMUL? FMUL was particularly slow due to the lack of integer multiplier AFAIK

Bernard · ‎07-11-2012

adaptive stochastic antialiasing in this example when you don't move the mouse

Adaptive stochastoc antialising is very good at minimizing computational cost and memory bandwidth,but at the cost of some irregular sampling pattern introduced by the random(stochastic) sampling.
What is the sampling filter used in the Robots demo?
Is this simple box filter or sinc filter?

Bernard · ‎07-11-2012

29K family microcontrollers designed for embedded systems, like laser printers, scaners, X terminals and

For intensive floating-point application the better option is to use Texas Instruments SHARC microprocessors.
But even these DSP microprocessors do not have some special functions directly implemented in hardware.
I think that we can come to conclusion that none of the general purposeDSP implememtssuch functions in the hardware and microprocessors useinstead software libraries.

bronxzv · ‎07-11-2012

What is the sampling filter used in the Robots demo?
Is this simple box filter or sinc filter?

the reconstruction filter is a box in this example, it's generally the best filter for low resolution raster images since other filterssuch as Gaussian andraised cosine lead to too much bluring and 2-3 lobes Lanczos too much ringing(note thatwe have these alternate reconstruction filters availablewith auserselectablefilter radius)

NB: sinc is a theoretical filter, not something you can use in practice with a realworld FIR filter kernel

Bernard · ‎07-11-2012

Gussian filter in random sampling could lead to the better results than box filter.Gaussian curve falloff resembles more subtle changes in the brightness(colour) more like real-world varying fields of colour.The problem with the stochastic approach will be to choose the radius of blurring properly which could not take into account jagged pixels(area where the aliasing occures) because of randomness of sampling.

bronxzv · ‎07-11-2012

Gussian filter in random sampling could lead to the better results than box filter.

at low sampling frequency yes (though I'll prefer raised cosine or Lanczos 2 over Gaussian), but with adaptive sampling you have a high sampling frequency in high frequency signal areas thus the local radius of your reconstruction filter must be very small to avoid excessive bluring and you will missin this case the farthestsamples (to the pixel center) within the pixel area

pixels on your screen arerectangular areas after all so a box reconstruction filter makes more sense in practice (with supersampling) thansome theoretical texts may let you think when reasoning about only discretereconstruction samples

anyway this is user selectable, and, since you have to ask, the default look is probably notthat bad

Bernard · ‎07-11-2012

@bronxzv Slightly off topic question.In one of your posts you mentjoned that Kribi project has 500 timers based on rdtsc instruction.Is it possible to post the proper usage of rdtsc instruction in those timers.It could be posted as a code template. Best regards Iliya

bronxzv · ‎07-11-2012

@bronxzv Slightly off topic question.In one of your posts you mentjoned that Kribi project has 500 timers based on rdtsc instruction.Is it possible to post the proper usage of rdtsc instruction in those timers.It could be posted as a code template. Best regards Iliya

IIRC this was mentioned in a private post

unfortunately this isa closed source framework so I'm not allowed to post source code from it

thekey advantage isthat itmakes it easy to install nested stopwatches in our source code with a simple (single line) notation, then after profile runs it reports detailed number of cycles and % in a nicely formatted report, for examplewith indentation for inner timings

for the actual usage of RDTSC it's simply using the advices from the paper I posted the other day, nothing special or innovative there

SergeyKostrov · ‎07-11-2012

Quoting bronxzv

integration. Even if FP-instructions on these microcontrollers cause "lightweight" Traps only ~3 clock cycles are needed to complete a vector fetch.

what was a "vector fetch" on such anancient purely scalar chip?..

In another words, every time whenan interrupt or trap occurs an address of some routine has to obtained
from a 256-entry vector table.

>>...do you knowhow many cycles were required foremulating basic fp instructions like FADD and FMUL?..

No. I just checked a 29K familty User's Manual and I have not found any technical details regarding "number of cycles to execute an instruction".

Best regards,
Sergey

Bernard · ‎07-11-2012

In another words, every time whenan interrupt or trap occurs an address of some routine has to obtained
from a 256-entry vector table

Why it is called "vector table".This is simply a data structure which holds a scalar values not vector values.

bronxzv · ‎07-12-2012

In another words, every time when an interrupt or trap occurs an address of some routine has to obtained
from a 256-entry vector table.

so it's only (part of?) the time to branch to the trap hander, it tells us nothing about the speed of the actual routine

No. I just checked a 29K familty User's Man
ual and I have not found any technical details regarding "number of cycles to execute an instruction".

it was probably something like 50-100 cyclesforFP32 FMUL and FADD (based on my past experience writing FP emulation routines)

bronxzv · ‎07-12-2012

http://en.wikipedia.org/wiki/Interrupt_vector

Bernard · ‎07-12-2012

http://en.wikipedia.org/wiki/Interrupt_vector

Yes I know this.
My question was slightly different.Members of IDT(IVT in DOS)are addresses i.e single binary number representing an address in the memory.Judging by the definition of the vector each IDT's entry should have been composed from a few values(addresses),but this is not the case.
I do not know why Intel decided to call it a vector.

bronxzv · ‎07-12-2012

My question was slightly different.Members of IDT(IVT in DOS) are addresses i.e single binary number representing an address in the memory.Judging by the definition of the vector each IDT's entry should have been composed from a few values(addresses),but this is not the case.

ah I see what you mean, I have no idea why it's called a vector, I'll consider the whole table as a vector but noteach individual address as you said, unlike the common usage

Bernard · ‎07-12-2012

It could be a vector when you consider the whole IDT.
Every IDT's "vector" point to the 8-byte descriptor which itself could be represented as a vector composed from various fields.

SergeyKostrov · ‎07-12-2012

Quoting iliyapolak

In another words, every time whenan interrupt or trap occurs an address of some routine has to obtained
from a 256-entry vector table
Why it is called "vector table"...

I think AMDusesa "Vector Table" termbecause Intel calls a similar structure as an"Interrupt Descriptor Table".
It looks like this is a "War of Terms" and the same applies to Oracle and Informix, etc.