DSPBA Floating Point Function Command Line Generation

DSPBA Floating Point Function Command Line Generation

CmdPolyEval

cmdPolyEval is a command line option to generate math functions. It offers an extended library of floating-point functions, together with a restricted set of fixed-point functions. It can be found in quartus\dspba\backend\<platform>

Type cmdPolyEval to get the latest version of the information below.

Example:

cmdPolyEval.exe –correctRounding –target StratixV -speedgrade 2 –frequency 250 –name myModel FPDiv 8 23 0

Where:

-correctRounding

will produce a correctly rounded result (IEEE-754 compliance) if the operator supports it, as opposed to faithful rounding if this option is not passed.

(In faithful rounding the resource utilization is slightly less).

-target StratixV -speedgrade 2 –frequency 250

(these should be passed in this order) targets the StratixV device, speedgrade 2, with pipelining to achieve 250MHz

-name MyModel

this should come before the function to be generated

FPDiv 8 23 0

generate floating point divide, 8 bits of exponent, 23 bits of mantissa

polynomial approx version 0 (see below)

Supported options

Arithmetic:

-correctRounding

-faithfulRounding

-error u

u the number of ulps of error acceptable (correctRounding=0.5, faithfulRounding=1.0)

-errors N u1 u2 ... uN

if the component has N ports, individual error bounds can be specified for each port.

This influences the test-bench generation process.

It is the user responsibility to correctly set the value for N to match the number of output ports.

Deployment:

-target <Device>

e.g. ArriaV, CycloneIVGX, StratixIV

-speedgrade S

-frequency MHz

-pipelining type

type = 0 -> combinatorial

| 1 -> subcycle DAG-based

| 2 -> subcycle DAG-based gen2

| 3 -> large granularity (no subcycle - old way)

-name N

-enable (generates global enable signal)

-wrapper

add input and output registers

-qnan

-noTruncMult

disables the use of truncated multipliers

-fuseCoefTables

with piecewise polynomial approximation, a number of tables are created, one for each coefficient.

When this option is passed, the tables are stitched together width-wise before mapping them to memory blocks.

It allows reducing the number of blocks used, at the expense of some synchronization logic.

For instance, for a degree 2 polynomial coefficient widths would be 11, 18, 27 (3 M20K) vs 11+18+27=56. (2 M20K)

cmdPolyEval now defaults to creating flat file structures. This means it generates files into the current directory. This mode has been enhanced in order to omit the safe_path files. With it compiled in release mode, the only file now generated is the top level .vhd file.

Testing:

-testbench N

/-> a testbench of N test cases can be generated.

/-> if no testbench type (example -randomTests)is selected then no vectors will actually be generated.

-randomTests

/-> runs the test bench with N test vectors (N is an input to -testbench)

-expRange eMin eMax

/-> the stimuli are generated in the eMin eMax range for floating-point inputs

-positiveStimuli

/-> the stimuli are positive

-negativeStimuli

/-> the stimuli are negative

-specialCaseTests

/-> the floating-point special values are automatically generated including zeros, inf, NaN

-cancellationTests

/-> test addition for cancellations

-nearInfinityTests

/-> tests close to the max FP number

-nearZeroTests

/-> tests close to the min FP number

-handTesting

/-> runs the associated hand built test vectors

-piTesting n w

/-> runs tests around the k*(pi/2) regions, where k=[0,n-1]. 1024 values are tested around each value

/-> the number of values around each multiple is 2*w

/-> this is useful for stressing trigonometric functions

-noChanValid

/-> no channel and valid data are generated in the stimuli and response files

-noFileGenerate

Do not write out any files

-printMachineReadable

Print information such as latency in the following machine readable format:

@@start

@filed1_name field1_value@

@filed2_name field2_value@

@@end

-allTests

An example to generate the test-bench:

cmdPolyEval.exe –pipelining 1 –correctRounding –target StratixIV –frequency 250 –name myModel FPDiv 8 23 0 -testbench 1000 -randomTests

in order to run the test-bench you need to first set

set QUARTUS_ROOTDIR_OVERRIDE = %QUARTUS_ROOTDIR%

then you need to make sure that in the myName_atb.do

quietly set compile(altera) 0

quietly set compile(altera_mf) 0

quietly set compile(lpm) 0

quietly set compile(wysiwyg) 0

is changed to

quietly set compile(altera) 1

quietly set compile(altera_mf) 1

quietly set compile(lpm) 1

quietly set compile(wysiwyg) 1

next, you can run your test:

vsim -do modelName/modelName_atb.do

Supported components:

Floating-Point:

Basic:

FPAdd wE wF

FPAddExpert wE wF tieBreaksToEven architecrure degradeAccuracy

tieBreaksToEven = 1 (IEEE-754 RNE, works only with arch = 0)

| 0 (IEEE-754 RNA)

archtiecture = 0 single-path low resources

| 1 dual-path low latency

degradeAccuracy {0|1} 2's complement is 1's complement

FPAddN wE wF

FPSubExpert wE wF tieBreaksToEven architecrure degradeAccuracy

tieBreaksToEven = 1 (IEEE-754 RNE, works only with arch = 0)

| 0 (IEEE-754 RNA)

archtiecture = 0 single-path low resources

| 1 dual-path low latency

degradeAccuracy {0|1} 2's complement is 1's complement

FPAddSub wE wF

FPAddSubExpert wE wF tieBreaksToEven architecrure degradeAccuracy

tieBreaksToEven = 1 (IEEE-754 RNE, works only with arch = 0)

| 0 (IEEE-754 RNA)

archtiecture = 0 single-path low resources

| 1 dual-path low latency

degradeAccuracy {0|1} 2's complement is 1's complement

FPFusedAddSub wE wF

FPMul wE wF

FPMulExpert wEA wFA wEB wFB wER wFR ieeeTieBreakRule

for correctRounding, the sticky bits are not computed. Rnd=1

FPConstMul wE wF constant

FPAcc wE wF lsbA msbA maxMSBX

FPSqrt wE wF

FPDivSqrt wE wF

FPRecipSqrt wE wF

FPCbrt wE wF

FPDiv wE wF version

version = 0 -> polynomial approximation

version = 1 -> polynomial approximation + Newton-Raphson (DP only)

version = 2 -> Newton-Raphson (NYA)

FPInverse wE wF

FPFloor wE wF

FPCeil wE wF

FPRound wE wF

FPRint wE wF

FPFrac wE wF

FPMod wE wF

FPDim wE wF

FPAbs wE wF

FPMin wE wF

FPMax wE wF

FPMinAbs wE wF

FPMaxAbs wE wF

FPMinMaxFused wE wF

FPMinMaxAbsFused wE wF

FPCompare wE wF type

type: -2=LT -1=LE 0=EQ 1=GE 2=GT 3=NEQ

FPCompareFused wE wF

(select line will select among LT, LE, EQ, GE, GT)

Exp, Log and Power:

FPLn wE wF

FPLn1px wE wF

implements ln(1+x)

FPLog10 wE wF

FPLog2 wE wF

FPExp wE wF

FPExpFPC wE wF

FPExpM1 wE wF

FPExp2 wE wF

FPExp10 wE wF

FPPowr wE wF

Trigonometrics with argument reduction:

FPSinX wE wF

FPCosX wE wF

FPSinCosX wE wF

FPTanX wE wF

FPCotX wE wF

Inverse trigonometric functions:

FPArcsinX wE wF

FPArcsinPi wE wF

FPArccosX wE wF

FPArccosPi wE wF

FPArctanX wE wF

FPArctanPi wE wF

FPArctan2 wE wF

**Trigonometrics of pi*x:**

FPSinPiX wE wF

FPCosPiX wE wF

FPTanPiX wE wF

FPCotPiX wE wF

Trigonometrics misc:

FPHypot wE wF

FPRangeReduction wE wF

Macro Operators:

FPFusedHorner wE wF r d a_{0} a_{1} ... a_{d}

FPFusedHornerExpert wE wF r g pOut pIn maxInExp maxCS d a_{0} a_{1} ... a_{d}

maxCS maximum cancellation size

g number of guard bits in tables

pOut res is positive (avoids final 2's complement)

pIn input is positive (avoids initial 2's complement)

maxInExp (-8/-9 typically)

FPFusedHornerMulti wE wF r d m a_{0} a_{1} ... a_{d}

m polynomials will be implemented using mults and adds

coefficient values are (d+1)*m coefs are read

r {0|1} restricted range x<=1

FPFusedMultiFunction wE wF

builds a multifunction block with Min/Max/MinMag/MaxMag

<=/</==/>=/>/!=/Saturate/Mux3:1

Fixed-Point:

FXPSin precIn precOut

FXPTruncMult precInX precInY precOut

FXPTruncMultSigned precInX precInY precOut

FXPTruncMultSignedUnsinged precInX precInY precOut

FXPConstMult precInX constant precOut

FXPConstMultSigned precInX constant precOut

FXPFusedMultiFunction w

builds a multifunction block with: and, or, xor, nandn nor,

xnor, inv, bitrev, EQ, NE, GE, LT, min, max, neg, abs,

redAnd, redOr, mux3:1, bitextract

FXPDivUI w

unsigned integer divider, w-bit inputs, 2w-bit output

FXPDivU wX fX wY fY wR fR

unsigned fixed-point divider, input and output formats provided

Conversion:

FXPToFP w f s wE wF

FPToFXP wE wF w f s

FPToFXPExpert wE wF w f s r

r = 1 for rounding to nearest, r = 0 for truncation

FPToFXPFused wE wF w f s

a dynamic input line selects between truncation = 0 and

round to nearest iteger = 1

FPToFP wEIn wFIn wEOut wFOut

Generated Code

The component is generated as if a DspBuilderAdvanced primitive subsystem (if you’re familiar with DspBuilderAdvanced ) – and there are several DspBuilderAdvanced ports that aren’t strictly necessary if using the component on its own.

H/W ports

xIn_v : this is the DspBuilderAdvanced valid input signal – you’ll see this just goes through with a delay of 17 cycles. i.e. if this signal goes high just when you put your first data through then the output valid will go high 17 cycles later when the result comes out. If you don’t want it it’s safe to remove this with the corresponding output and delay registers.

xIn_c : this is the DspBuilderAdvanced channel input signal – Likewise safe to remove along with the delay (implemented in a memory + registers) and output. It there to help you perhaps keep track of channelized data flowing through the divide.

xIn_0 : first data input (recent componets have the fist data port named a)

xIn_1 : second dat input (if it exists, will probably be called b)

xOut_v : this is the DspBuilderAdvanced valid output signal (see notes on xIn_v above)

xOut_c : this is the DspBuilderAdvanced channel output signal (see notes on xIn_c above)

xOut_0 : result (probably set to q)

clk : clock signal

areset : asynchronous clear

bus_clk : this is a DspBuilderAdvanced bus clock port for when the design contains Avalon-MM slave interfaces. Here it’s not connected to anything, so can be removed.

h_areset : this is a DspBuilderAdvanced bus reset port for when the design contains Avalon-MM slave interfaces. Here it’s not connected to anything, so can be removed.

Running on Windows

Using a Windows installation of ACDS 12.1(+) with DspBuilder, you should go to the folder

…\quartus\dspba\Blocksets\BaseBlocks\windows64

where you will find the executable CmdPolyEval.exe

Running on 64-bit Linux

Using a Linux installation of ACDS 12.1(+) with DspBuilder, you should go to the folder

$QUARTUS_ROOTDIR/dspba/Blocksets/BaseBlocks/linux64

where you will find the executable cmdPolyEval. You must invoke cmdPolyEval in a directory that you have write permissions and must ensure that LD_LIBRARY_PATH contains directories containing shared libraries used by cmdPolyEval. The following Bash command will ensure LD_LIBRARY_PATH includes the required directories:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$QUARTUS_ROOTDIR/dspba/Blocksets/BaseBlocks/linux64:$QUARTUS_ROOTDIR/linux64

Example

Using the command-line tool:

> cmdPolyEval

should bring up the help, explaining the cores which can be generated but also some details on the generation options. When generating a component with the corresponding test-vectors you will execute something like:

> cmdPolyEval.exe -subcycle -target CycloneIVE –frequency 150 FPMin 8 23 -testbench 100

which will generate the topModel/ folder in the current folder containing the VHDL code for the components and 100 test-vectors. In this example we are generating the floating point Minimum function. For the test-vectors, you should be able to find:

topModel_xIn.stm

topModel_xOut.stm

The first file contains the input stimuli and the second one contains the output response.

One line of the input stimuli file looks like this for FPMin:

Valid_Line Channel_Line X Y

1 00000000 01000000100000000000000000000000 11000000100000000000000000000000

Here 01000000100000000000000000000000 11000000100000000000000000000000 are the input stimuli for this test-vector, in binary, IEEE-754 Single Precision notation - e.g. X here is interpretted as

0 10000001 00000000000000000000000

S EXP FRAC

The second file topModel_xOut.stm, will contain the outputs corresponding to the inputs. The first m lines of the file, in the case the operator is pipelined and has a latency of m cycles will be just zeros, corresponding to the time needed for the first set of inputs to reach the output. The first output line you should be interested in has a leading one:

Valid_Line Channel_Line R_low R_high

1 00000000 11000000100000000000000000000000 11000000100000000000000000000000

Again, you may ignore the first two chunks after performing the detection. The next two chunks (identical for this operator, but different by one unit in the last place for faithfully rounded functions), represent the corresponding output value for the test-case in the input file. Essentially, the operator passes the test-vector if the output is any of these two value. This is as close as possible to obtaining the test-vectors without going through the Simulink interface (without using a license).

Note; when generating you may also restrict the exponent range,

cmdPolyEval.exe -pipelinig 1 -target CycloneIVE –frequency 150 FPMin 8 23 -testbench 100 –expRange -5 5 -randomTests

(generates inputs roughly in the range (-63,63), with the closest to zero being 2^-5 * 1.000000XXX). Also, in order to generate positive stimuli only, you scan use

cmdPolyEval.exe -pipelining 1 -target CycloneIVE –frequency 150 FPMin 8 23 -testbench 100 –expRange -5 5 -positiveStimuli -randomTests