Complex floating-point sequential logic in Verilog

Altera_Forum · ‎07-05-2010

Hi,

I'm trying to write a synthesizable 3D rasterizer in Verilog/SystemVerilog. The rasterizer right now is not really a 3D rasterizer: it just receives six 32-bits floats for vertex position (vertA_pos_x, vertA_pos_y, vertB_pos_x, vertB_pos_y, vertC_pos_x, vertC_pos_y) and nine 8-bits integers for vertex coloring (vertA_color_r, vertA_color_g, vertA_color_b, vertB_color_r, vertB_color_g, vertB_color_b, vertC_color_r, vertC_color_g, vertC_color_b).

Positions' ranges are 0.0f ~ 1.0f, 0.0f representing the top/left side of the screen, 0.5f the middle of it and 1.0f the bottom/right side.

The raster work would be to, first, count how many raster lines are required. Given that the framebuffer height is 240 pixels, vertex A is the top vertex, B is the bottom-left one, C is the bottom-right one and X is the bottommost vertex (either B or C; this has to be calculated), the number of raster lines is given by (vertX_pos_y - vertA_pos_y) / 240.

This part of the rasterization process is complex enough to expose my doubts, so I'll stop explaining how I would proceed here.

Now what I want to know is how to implement such "complex" logic in Verilog (it is "complex" because it is sequential and takes more than one clock cycle, which is not exactly the most pleasant kind of thing to design with a hardware description language).

The floating-point operation megafunctions that come with Quartus all require more than one clock cycle to finish, so, to implement "simple" calculations like (vertX_pos_y - vertA_pos_y) / 240, I'm assuming a fairly boring-to-write and error-prone state machine is necessary.

My biggest expectation is that someone will tell me I don't need that, but if that's not the case, I still would like to know how people generally design things like these.

Also notice that I'm very new to Verilog and hardware design in general, so I'm sorry if I say something stupid. Ideas?

Altera_Forum · ‎07-05-2010

Some of the FP functions (at least the simpler ones such as add/sub and mult) are fully pipelined. This means that you can provide them with a new pair of operands each clock cycle and you'll get a new result each clock cycle, with a given delay.

This is, I think, not true for more complex functions such as division (but don't take my word for it).

A calculation such as "(vertX_pos_y - vertA_pos_y) / 240" could easily be pipelined, by replacing "/240" with "*.00416666666666666666".

Depending on your needs, pipelining may be feasisble and may contribute to simplify your state machine.

Otherwise.. you are correct, you need a state machine.

On another note, have you considered fixed point instead of floating point?

It woudl require less resources and less cycles -- which could also simplify your state machine.

Altera_Forum · ‎07-05-2010

Pleased to meet you, rbugalho, and thanks for your answer.

I'm using Altera Quartus II 9.11, Web Edition version. All of the FP megafunctions in my library seem to be non-pipelined (is this what they call it when a module requires more than one clock to do a given task?).

They all ask me to specify "the output latency in clock cycles", like this ALTFP_MULT screenshot shows (it is the same for ALTFP_ADD_SUB and ALTFP_DIV):

img121*imageshack*us/img121/5323/altfpmult.png (replace asterisks by dots)

Am I mistaken to assume that these megafunctions require more than one clock cycle to finish processing?

Btw, LPM_ADD_SUB left me wondering if we aren't using mistaken nomenclature:

img689*imageshack*us/img689/6064/lpmaddsub.png (replace asterisks by dots)

Doesn't "being pipelined" mean "immediately processed", without clock latency (i.e. only obvious digital logic delay)?

Regarding fixed point math, it's certainly something I'll take into consideration. It's just that I'm just curious about all this FP stuff right now, anyway.

-- edit --

I was wondering about your optimization suggestion of changing the division by an equivalent multiplication and came to the conclusion that it should be very, very easy for the analyser/synthesizer to do that for you.

As long as it's a division by a constant value, can't it always infer an equivalent, faster multiplication? If so, then why isn't this already done?

-- edit --

Uh, never mind. I guess Verilog wouldn't know the registers are floating-point, and so that couldn't be done. It would simply do an integer division in that case.. right? Or is there an elegant way around this?

Cheers,

n2

(ps.: [To alteraforum.com] Your non-link & non-image posting policy for newbies and limited number of images permitted per post ****es anyone off entirely; I strongly suggest you to remove such limitations)

Altera_Forum · ‎07-05-2010

Hi,

by "fully pipelined", I mean that they take more than one cycle to calculate a result but they can produce results at a rate of 1 per cycle.

For example, with fully pipelined multiplier with a 5 cycle latency you can do this:

cycle 0: input a = x0; input b = y0 ; output = x

cycle 1: input a = x1; input b = y1 ; output = x

cycle 2: input a = x2; input b = y2 ; output = x

....

cycle 5: input a = x5; input b = y5; output = x0 * y0

cycle 6: input a = x6; input b = y6; output = x1 * y1

cycle 7: input a = x7; input b = y7; output = x2 * y2

And as you guessed, no, unlike software compilers, for a number of reasons, the tools will not replace the division by a constant for a multiplication.

On one hand, these are IEEE 754 modules and in that context, division and multiplication by the reciprocal are not 100% equivalent.

On other hand, there are diferences in terms of sequential behaviour.

A 32 bit FP divider requires, IIRC, 20 cycles to produce a result. The tools could, at best, replace it with a multiplication that takes 20 cycles.

The human, on the other hand, can replace it with a multplication that takes just 5 cycles.

Altera_Forum · ‎07-06-2010

Ah, that makes a lot of sense, both pipeline and constant optimization explanations. I'm gonna try doing something useful with these pipelined megafunctions later.

As a side note, I just noticed that such pipelining would be an ideal way to deal with the massive rasterization a GPU has to work with.

I was thinking of duplicating the rasterization modules so that they could all work on different pixels at the same time, but since they all obviously take more than one clock cycle to complete processing each pixel, pipelining them could drastically improve performance.

Is this an achievable feat for a Verilog newbie? In any case, I would be grateful if you could tell me about good literature to study this.

You're being very instructive and friendly, rbugalho;

thanks a lot,

n2

Altera_Forum · ‎07-06-2010

I know next to nothing about graphics, so I don't have much of a clue about the complexity of your undertaking.

That said, the simpler FP megafunctions are already fully pipelined, weather you want it or not. All you have to do is take advantage of that. And in some ways, it should lead to a "simpler" design with less complex state machines.

I'm not sure about more complex ones, such as division and multiply+accumulate.

Quite honestly, I don't actually remember any literature that deals with pipelining and resource scheduling. Although I'm pretty sure there oughta be a few books that do, this is, pretty much, digital design 201 and I learnt most of it from teachers' notes.