FPGA Intellectual Property
PCI Express*, Networking and Connectivity, Memory Interfaces, DSP IP, and Video IP
6360 Discussions

How to realize the pipeline in mixed operation of the divider or multiplication?

Altera_Forum
Honored Contributor II
2,419 Views

If I want to calculate the following function: 

f = (x * y + a) / z + b;  

there is five input variables and they may be changed at every clock. If use combinational logic to synthsize it, the performance is very low. 

So a good way is to use pipeline sequence to realize. 

 

Suppose the multiplication or division can been done in one clock, it can use the pipeline realize it as the following verilog code: 

 

always@(clk) 

begin 

 

a1 <= a; 

 

b1 <= b; 

b2 <= b1; 

b3 <= b2; 

 

z1 <= z; 

z2 <= z1; 

 

 

pipe0 <= x*y; //delay 1 clock according to input 

pipe1 <= pipe0 + a1; //delay 2 clocks according to input  

pipe2 <= pipe1/z2; //delay 3 clocks according to input  

f <= pipe2 + b3; //delay 4 clocks according to input  

 

end 

 

But if use LPM_DIVIDE or LPM_MULT, the multiplication or division result may not been finished in one clock. And the detail clocks the LPM costs is not sure, so I don't know use how many clocks delay to keep the synchonization among a, b, z and pipelines . 

Please give me some suggestions on the mixed operation using pipeline.
0 Kudos
9 Replies
Altera_Forum
Honored Contributor II
906 Views

The solution seems correct if multiplier and divider are used without additional pipelining (no clock provided to LPM_MULT or LPM_DIVIDE). I don't know if it already meets your performance requirements then. You also didn't tell about the used hardware. If a family with hardware multiplier is involved, I would expect that no pipeline is needed for the multiplier and first addition (but doesn't harm of course). In contrast, the divider is more likely to need additional pipeline stages. 

 

You should be able to find out the necessary pipeline level at your intended clock speed and word width for each partial operation by preceeding tests and then assemble the complete function with respective delays for the iinvolved operands. It could be written as a parmeterizable design with arrays and for loops also, but it's probably more simple as you did it. 

 

If you are in doubt about the actual pipeline delay of particular MegaFunctions, Quartus Simulator (or ModelSim) can easily clarify it with a numerical test.
0 Kudos
Altera_Forum
Honored Contributor II
906 Views

I show you an example in the attachement. 

I do y=k*(x-b) in the "Y_BOX_8" mod,as you know ,the multiplication or divider with logic element will get a low speed. so , you need a fifo with different output width befor you do multiply. For example ,the input width of your 'x' is 16 bit ,and the frequence is 100Mhz. then you need a fifo which input width if 16bit and the output width is 16*8= 128 bit,then the read clock of this fifo shoud be 100Mhz/8 = 12.5Mhz or higher. 12.5Mhz will be sure to finish the multiply operation. the only problem is that you need 8 multiplyer.  

I am poor in English , if you can't understand what i said , I am sorry. send E_mail to me ,I will try my best to explain that. 

good luck
0 Kudos
Altera_Forum
Honored Contributor II
905 Views

I don't see your example exactly related to the question. It is using different clocks and likely to cause an unclear timing situation while sotusotu had a straightforward synchronous design as it should be. For longer delays (if needed at all), RAM based FIFOs instead of registers could be used, but with a single clock if ever possible. 

 

Furthermore I think your delay examples are not corresponding to typical todays FPGA speed, they rather apply to last millenium types. As said, hardware multipliers can be used preferably.
0 Kudos
Altera_Forum
Honored Contributor II
906 Views

Thanks FvM and nobody1234. But I am still in confusion . 

In my question, there is two important issues: 

1) Whether one multiplication or division operation can be done in one clock? 

In my computation there are tens of multiplication or division operation, so my pipeline idea is cutting the long formula into single operations, i.e. only pipeline each step, rather than do pipeline in one opertation. 

So I'd like to use parallel multiplier or dividers and set LPM_MULT or LPM_DIVIDE pipeline stage 0. 

For example, do f=(x[7:0]*y[3:0])/z[4:0]; the simple verilog code is: 

 

module test(...);  

//io declaration... 

always@(clk) 

begin 

pipeline0 <= xy; 

pipeline1 <= xyz; 

f <= pipeline1; 

end  

 

lpm_mult lpm_mult_component ( 

.dataa (x[7:0]), 

.datab (y[3:0]), 

.clock (clk), 

.result (xy), 

.aclr (1'b0), 

.clken (1'b1), 

.sum (1'b0)); 

defparam 

lpm_mult_component.lpm_hint = "MAXIMIZE_SPEED=5", 

lpm_mult_component.lpm_pipeline = 0, 

lpm_mult_component.lpm_representation = "UNSIGNED", 

lpm_mult_component.lpm_type = "LPM_MULT", 

lpm_mult_component.lpm_widtha = 8, 

lpm_mult_component.lpm_widthb = 4, 

lpm_mult_component.lpm_widthp = 12; 

 

lpm_divide divider1 ( 

.denom (z[4:0]), 

.clock (clk), 

.numer (pipeline0[11:0]), 

.quotient (xyz), 

.remain (), 

.aclr (1'b0), 

.clken (1'b1)); 

defparam 

divider1.lpm_drepresentation = "UNSIGNED", 

divider1.lpm_hint = "LPM_REMAINDERPOSITIVE=TRUE", 

divider1.lpm_nrepresentation = "UNSIGNED", 

divider1.lpm_pipeline = 0, 

divider1.lpm_type = "LPM_DIVIDE", 

divider1.lpm_widthd = 5, 

divider1.lpm_widthn = 12; 

endmodule 

Because of the difference of the operation width and FPGA complexity, the above two pipe step will lead to different caculation latency. 

Some operation may get stable output in one clock, and some will be stable in two clocks. Assume I will do tens of multiplication or division operation, 

if there is one step latency exceeding one clock, then my pipeline idea is useless.  

How to overcome this issue? 

 

2) input alignment in pipeline. 

In each operation such as x*y or x/y, the x,y must be arrived at the same clock edge. Assume i will do 20-level pipeline operation, whether I must delay some variables 20 times in the last pipeline. 

Is there good way to delay one variable with 20 clocks like: 

a_1 <= a; 

a_2 <= a_1; 

... 

a_20 <= a_19; 

 

Otherwise i must define a lot of intermediate registers to pipeline it. 

 

As nobody1234's idea, i think it's good for a single operation. While it's not efficient for tens-level multiplier or divider.It cost FPGA resource too much.
0 Kudos
Altera_Forum
Honored Contributor II
906 Views

As for your first question: yes it is possible for LPM_MULT and LPM_DIVIDE to yield results in one cycle... in fact, you can control the latency of the megafunctions. To do so, simply change the following lines: 

 

lpm_mult_component.lpm_pipeline = 0, 

divider1.lpm_pipeline = 0, 

 

Right now, pipeline level is set to 0, meaning that it would finish computing within the same cycle. If you want to wait one more cycle, then you can simply set it to 1. Of course, there is a trade off between pipeline level and the area of the multiplier/divider.  

 

As for your second question: I suppose you can use instantiate a shift register (just 1 tap and you can specify how many cycle delay you would want from input and output... in your case 20). Otherwise, I suppose you can use a FIFO and just control when you want to read/write to/from FIFO. 

 

Hope this helps...
0 Kudos
Altera_Forum
Honored Contributor II
906 Views

1) Whether one multiplication or division operation can be done in one clock? 

The multiplication or division operation is not register based operation, they are combination logic operation, the clock you used just clock the operation result to a register. The following code has been validated when the clock is running at 10Mhz. 

ENTITY Y_BOX_cell IS 

PORT (clk: in std_logic; 

k: in std_logic_vector (15 downto 0) 

b: in std_logic_vector (15 downto 0); 

x: in std_logic_vector (15 downto 0); 

result: out std_logic_vector (15 DOWNTO 0); 

END Y_BOX_cell; 

ARCHITECTURE rtl OF Y_BOX_cell IS 

signal b_int,x_int,x_int1 : signed (15 downto 0); 

signal k_int : signed (15 downto 0); 

signal pdt_int : signed (31 downto 0); 

 

BEGIN 

 

Process(clk) 

Begin 

if clk'event and clk='1' then 

k_int <= signed (k); 

b_int <= signed (b); 

x_int <= signed (x); 

x_int1 <= x_int - b_int; 

pdt_int <= k_int * x_int1; 

result <= std_logic_vector(pdt_int(31)& pdt_int(27 downto 13) 

end if; 

end process; 

END rtl;
0 Kudos
Altera_Forum
Honored Contributor II
906 Views

Thanks all. wronghorizon's suggestion is very good. I have another question. 

How does the pipeline work in LPM_DIVIDE? LPM_DIVIDE can only be synthesized into combinational logic,  

Assume the result (x/y) need 1.5 clocks and I set the lpm_divide.lpm_pipeline = 3,  

Does it mean the lpm_divide just shift register the result with 3 clocks,or sample the result at each clock edge for 3 clocks? 

If it's shift register, i thinks the result at the last pipeline clock may not be the right result.Because it will shift every unstable result. 

But if it's sampling the result for 3 clocks and get the right result at the last clock edge, i cannot imagine how does the LPM_DIVIDE pipeline realize. 

Futhermore,I find a strange thing in LPM_DIVIDE synthesis. 

for example, do f = (x[7:0]/y[3:0]+a[7:0])/z[3:0] in my pipeline idea on EP2S60F1020C4, the synthesis result vs. pipeline stage: 

lpm_pipeline ALUTs ALMs registers memory bits Acutal Fmax  

0 127(13) 74(18) 32(32) 0 >500MHz 

1 126(12) 88(28) 77(44) 0 168MHz 

3 130(9) 97(20) 121(34) 20 199M 

5 126(8) 114(23) 176(43) 36 264M 

10 154(49) 152(10) 259(15) 140 329M 

 

when use lpm_pipeline=0 , why is the resource and the frequecy best?
0 Kudos
Altera_Forum
Honored Contributor II
906 Views

I basically don't understand your consideration regarding pipeline operation. If the timing constraints are met, you can expect correct results at the output for each clock cycle. The result is delayed by the specified number of clocks, that's the whole story.  

 

There are no unstable results at the output, except for glitches outside the time window, during that the results are processed by the suceeding logic. 

 

Your Fmax analysis shows that the divider could operate in one clock cycle. Resource usage is always minimal for a non-pipelined solution, timing may be different. I have no particular explaination, why the pipeline=1 solution is as bad, but actually, I don't need an explaination, I guess, you also don't.  

 

One point to be considered: A non-pipelined divider may borrow part of it's logic delay budget from the source and target registers, if they have plenty of timing margin. So the results probably depend on the complete datapath structure.
0 Kudos
Altera_Forum
Honored Contributor II
906 Views

I am interested in pipeline operation in LPM_divide :  

 

--- Quote Start ---  

Assume the result (x/y) need 1.5 clocks and I set the lpm_divide.lpm_pipeline = 3, 

Does it mean the lpm_divide just shift register the result with 3 clocks, 

or sample the result at each clock edge for 3 clocks? 

--- Quote End ---  

 

No answer yet. Not enough information provided by altera PDFs 

 

 

--- Quote Start ---  

But if it's sampling the result for 3 clocks and get the right result at the last clock edge, i cannot imagine how does the LPM_DIVIDE pipeline realize. 

 

--- Quote End ---  

 

You can "sample" every 3 clock cycles with D Flip-Flop with Enable and a very small state machine. 

 

In my design, I use pipelined LPM_DIVIDE megafunction provided by quartus. 

I arrange in order numerator to change at every 3 clock cycle only. 

To do that, I ( I mean Quartus :) ) make a minimal state machine. 

 

But I still interested in how pipeline work in LPM_DIVIDE. 

 

An other point : Quartus synthesizer employ embedded mult. if they are available in the target chip. 

 

excuse for my english, i'm french.
0 Kudos
Reply