Delaying output of a module to close timing

Altera_Forum · ‎06-01-2016

I've created a module where two numbers are multiplied and the output is provided in a number of clock cycles. However, the output of the module (pdt1_out) is failing timing in that the setup slack at the output is -9.450 ns.

I don't need the output until at least 10 ns (or later) after assignment of the inputs. Is there a way to delay the output so that timing can be successfully closed? Here is a short snippet of Verilog code showing how I've attempted to pipeline the output (two clock cycles).

However, there are still issues with closing timing and the setup slack remains negative. TimeQuest is measuring the time taken for the output pdt1_out, whereas I don't need the output until pdt1 is updated. This can be pipelined further with more registers.

// pipelined multiplier 64-bit in, 64 bit out
reg  am1 = 0;
reg  bm1 = 0;
reg  pdt1 = 64'd0;
reg  pdt1_out0 = 64'd0;
wire  pdt1_out;
unsigned_mult64 unsigned_mult64  
(     .a(am1), 
    .b(bm1), 
    .clk(mult_clk),  // run this on a slower clock
    .out(pdt1_out)
); // end
// pipeline (2 clock cycles)?
always @(posedge clk) begin
    pdt1_out0     <= pdt1_out;
    pdt1          <= pdt1_out0;  
end

Altera_Forum · ‎06-01-2016

--- Quote Start ---

I've created a module where two numbers are multiplied and the output is provided in a number of clock cycles. However, the output of the module (pdt1_out) is failing timing in that the setup slack at the output is -9.450 ns.

I don't need the output until at least 10 ns (or later) after assignment of the inputs. Is there a way to delay the output so that timing can be successfully closed? Here is a short snippet of Verilog code showing how I've attempted to pipeline the output (two clock cycles).

However, there are still issues with closing timing and the setup slack remains negative. TimeQuest is measuring the time taken for the output pdt1_out, whereas I don't need the output until pdt1 is updated. This can be pipelined further with more registers.

// pipelined multiplier 64-bit in, 64 bit out
reg  am1 = 0;
reg  bm1 = 0;
reg  pdt1 = 64'd0;
reg  pdt1_out0 = 64'd0;
wire  pdt1_out;
unsigned_mult64 unsigned_mult64  
(     .a(am1), 
    .b(bm1), 
    .clk(mult_clk),  // run this on a slower clock
    .out(pdt1_out)
); // end
// pipeline (2 clock cycles)?
always @(posedge clk) begin
    pdt1_out0     <= pdt1_out;
    pdt1          <= pdt1_out0;  
end

--- Quote End ---

your timing problem is likely inside mult and so external registers wouldn't help. Add more pipe stages inside mult.

By the way 64 x 64 bits requires 128 bits, Have you accounted for that? I also advice using one clock.

Altera_Forum · ‎06-01-2016

Thanks for your response, kaz; this is much appreciated.

I will try using the same clock for the unsigned_mult64 module. I had switched to a slower clock in an attempt to close timing.

I am actually trying to emulate multiplication using 64-bit integers (as on a desktop computer), so I would like the output to also be 64 bit.

Here is the code that I am using for the pipelined multiplier. How would I add more pipeline stages inside the multiplier? What I am trying to do here is to infer a megafunction with a pipelined multiplication.

Is this the right way to do so or does the output have to be 128 bits?

Here is the code with timing issues:

module unsigned_mult64 ( a, b, clk, out);
// pipelined multiplier (4 cycles)
output  out;
input clk;
input signed  a;
input signed  b;
reg signed  a_reg0;
reg signed  a_reg1;
reg signed  a_reg2;
reg signed  b_reg0;
reg signed  b_reg1;
reg signed  b_reg2;
reg signed  out;
wire signed  mult_out;
assign mult_out = a_reg2 * b_reg2;
always @ (posedge clk)
begin
	// levels + 1 = pipeline = 4
	a_reg0 <= a;
	a_reg1 <= a_reg0;
	a_reg2 <= a_reg1;
	
	b_reg0 <= b;
	b_reg1 <= b_reg0;
	b_reg2 <= b_reg1;
	
	out <= mult_out;
end
endmodule

Altera_Forum · ‎06-01-2016

--- Quote Start ---

Thanks for your response, kaz; this is much appreciated.

I will try using the same clock for the unsigned_mult64 module. I had switched to a slower clock in an attempt to close timing.

I am actually trying to emulate multiplication using 64-bit integers (as on a desktop computer), so I would like the output to also be 64 bit.

Here is the code that I am using for the pipelined multiplier. How would I add more pipeline stages inside the multiplier? What I am trying to do here is to infer a megafunction with a pipelined multiplication.

Is this the right way to do so or does the output have to be 128 bits?

Here is the code with timing issues:

module unsigned_mult64 ( a, b, clk, out);
// pipelined multiplier (4 cycles)
output  out;
input clk;
input signed  a;
input signed  b;
reg signed  a_reg0;
reg signed  a_reg1;
reg signed  a_reg2;
reg signed  b_reg0;
reg signed  b_reg1;
reg signed  b_reg2;
reg signed  out;
wire signed  mult_out;
assign mult_out = a_reg2 * b_reg2;
always @ (posedge clk)
begin
    // levels + 1 = pipeline = 4
    a_reg0 <= a;
    a_reg1 <= a_reg0;
    a_reg2 <= a_reg1;
    
    b_reg0 <= b;
    b_reg1 <= b_reg0;
    b_reg2 <= b_reg1;
    
    out <= mult_out;
end
endmodule

--- Quote End ---

you need to use megawizard to get a mult that can be internally pipelined. This is the main timing bottleneck.

external registers are ok (one stage for inputs,one for outputs) but any back to back registers would not help timing.

64 x 64 bits mult requires 128 bits. If you want 64 bits only then select 64 LSBs but provided the MSBs do not carry any part of result.

Alternatively use 64 MSBs but that means division by 2^64

Altera_Forum · ‎06-01-2016

Thanks, kaz. My intention was to have Quartus automatically infer a pipeline multiplier from the HDL code, but I will try the megawizard instance first. I will reply to this thread when I've managed to pipeline the multiplier using the IP.

Altera_Forum · ‎06-01-2016

Have you thought about using the multicycle path construct in TimeQuest to tell the optimizer that your logic block between the two registers takes more than one cycle to compute? In your case you would set the multicycle multiplier to a value of '2' (assuming a 10ns clock and your multiplier takes 20ns minus register setup/output times to compute).

See: https://www.altera.com/support/support-resources/design-examples/design-software/timequest/tq-multicycle-path.html

and

https://www.altera.com/support/support-resources/design-examples/design-software/timequest/exm-tq-sdc-exceptions.html

(https://www.altera.com/support/support-resources/design-examples/design-software/timequest/exm-tq-sdc-exceptions.html)

You need constraints something like:

set_multicycle_path -from [get_keepers {a_reg2

[*] b_reg2

[*]}] -to [get_keepers {out

[*]}] -setup 2

set_multicycle_path -from [get_keepers {a_reg2

[*] b_reg2

[*]}] -to [get_keepers {out

[*]}] -hold 1

where the FROM and TO operands need to be expanded to the full path to the appropriate register instances (including module instance name).

Setup 2 / Hold 1 indicates the data takes two clocks (20ns) to stabilize on changing, and goes invalid immediately after changing (0ns).

The other pipelining (via _reg0 and _reg1) is basically irrelevant and could be removed. It would only be necessary for data alignment.

Altera_Forum · ‎06-02-2016

kaz and ak6dn:

Thank you so much for your help; this is much appreciated!

1. To resolve the timing issues, I had to create an instance of the multiplier using the megawizard IP. The inputs were both 64 bit, and the output was 128 bits. The IP was set to have a pipeline of 4 clock cycles.

2. The multicycle path constraint works well to remove some timing issues. The description given by ak6dn is good and I've learned much from this.

Apparently the recommended HDL coding style (page 12-4, https://www.altera.com/content/dam/altera-www/global/en_us/pdfs/literature/hb/qts/qts_qii5v1.pdf) for an inferred multiplier does not work well in all situations (particularly for a high clock frequency) and cannot be easily extended beyond 2 clock cycles.

So once again, thank you. The timing issue is resolved.

By the way, the regs within the unsigned_mult64 module should be unsigned rather than signed.

Altera_Forum · ‎06-02-2016

--- Quote Start ---

kaz and ak6dn:

Thank you so much for your help; this is much appreciated!

1. To resolve the timing issues, I had to create an instance of the multiplier using the megawizard IP. The inputs were both 64 bit, and the output was 128 bits. The IP was set to have a pipeline of 4 clock cycles.

2. The multicycle path constraint works well to remove some timing issues. The description given by ak6dn is good and I've learned much from this.

Apparently the recommended HDL coding style (page 12-4, https://www.altera.com/content/dam/altera-www/global/en_us/pdfs/literature/hb/qts/qts_qii5v1.pdf) for an inferred multiplier does not work well in all situations (particularly for a high clock frequency) and cannot be easily extended beyond 2 clock cycles.

So once again, thank you. The timing issue is resolved.

By the way, the regs within the unsigned_mult64 module should be unsigned rather than signed.

--- Quote End ---

Glad it is sorted out. Yes inference lacks certain internal features and most old style designers do not like inference unlike new comers who want portable code (so called).

Regarding multicycle you need care to define your case. If output stream is delayed relative to input stream but data rate is as clock rate then multicycle does not apply even if your output stream is delayed terribly. if input rate and hence output rate is a regular fraction of clock rate then yes but even then you need to control phase of sampling clock such as by using clk enable.

Altera_Forum · ‎06-02-2016

Thanks, kaz. Yes, that sums up nicely what I've also found regarding the use of multicycle paths for this particular design.