Re: "Simple" delay chain mystery

Altera_Forum · ‎04-24-2008

I have cascaded 5 AND2 gates, and in a timing simulation compare the outputs of all of the gates against each other. Each AND2 gate has its uncascaded input connected to a PIN, and each output is brought out to a pin, in an attempt to foil having gates optimized out. All outputs transition at the same time in a timing simulation, however. I get the following explanation from Altera support:

"The output at DUMOUT5 should appear at the same time as POR. As mentioned in the previous note, you can see after compilation, actually each output is similarly connected to one MCELL. Even in the bdf design, the output DUMOUT5 is output from several and gates, QuartusII merges them to one MCELL after compilation. So the delays from input POWER_ON_RESET_EXT to output DUMOUT5 and POR are exactly same."

"I attached two screen copies here. One is the Techonoly Map Viewer. You can see there is one MCELL between input POWER_ON_RESET_EXT and output DUMOUT5, similarly as POWER_ON_RESET_EXT to output POR. The five and gates to DUMOUT5 are merged to one CELL, not using five cells. Another screen copy is the tpd timing report for the two paths. The tpd values from input POWER_ON_RESET_EXT to output DUMOUT5, and input POWER_ON_RESET_EXT to output POR are same."

---end quote of support

Are the chain's outputs being deliberately delayed (given varying delays) by Quartus to transition at the same time? If not, how can they all transition at the same time (remember this is NOT a functional simulation). I chose AND2 gates, so Quartus could not do something extra clever, as it might with inverters, and say "you just have a chain of inverters, so I'll pick the signals off the input or output of a single inverter." The unknown inputs to the AND2 gates require that there be 5 distinct gates. Whether or not these gates are in a single MCELL, as support points out, mustn't there be a propagation delay from one gate to the next? What am I missing here, oh wise Altera forum gurus? Thanks.

Altera_Forum · ‎04-24-2008

Hello,

I don't know the details of your design, but it could be possible depending on the choise of device family and how it was mapped. The stratix ii and iii have 6-inputs lookup table, so the design can be mapped like,

out1 = in1 & in2

out2 = in1 & in2 & in3

out3 = in1 & in2 & in3 & in4

out4 = in1 & in2 & in3 & in4 & in5

out5 = in1 & in2 & in3 & in4 & in5 & in6

What is the your design like in the techmap viewer?

Altera_Forum · ‎04-24-2008

Obviously, routing a combinational signal to a pin has no influence how other expressions, using the same combinational signal in your logic are actually compiled. I assume, that a delay chain can be generated in Quartus by using synthesis attributes, but I didn't ever try, cause I don't use delay chains in designs.

I now, that a friends MaxPlus ACEX design, that uses delay chains to achieve a particular timing (ACEX has no PLLs) was effectively un-compilable with Quartus. I wouldn't expect that Quartus has additional means to enable delays that are normally unwanted and removed by the compiler in an impressing way.

Altera_Forum · ‎04-24-2008

The solution appears to be to put LCELLS in for delays and then "turn off" optimization, according to tech support, although exactly which optimizations to turn off where not specified...sheesh, thanks, support, for the complete answer.

Anyway, I changed Ignore LCELL Buffers from AUTO to OFF, and that preserves the LCELL delay. Yippee. I don't know if support changed any other optimization parameters.

I appreciate your comments, gee and FvM. It looks to me as if gee's design would have synthesized parallel logic and use more gates than a cascade...but it wouldn't have produced a delay, which is exactly his point -- although it would surprise me if the compiler would go to the trouble of synthesizing parallel logic just to deprive me of my delay!

I don't think I agree with FvM that "routing a combinational signal to a pin has no influence how other expressions, using the same combinational signal in your logic are actually compiled." If an intermediate result is brought out of the chip, it cannot be optimized out by the compiler...and preserving that intermediate step must produce a delay, which is what I wanted. Delay chains are necessary, I believe, in a PLD or other limited-resource device where we don't have the luxury of synchronizing every input to a high-speed clock.

Altera_Forum · ‎04-24-2008

--- Quote Start ---

If an intermediate result is brought out of the chip, it cannot be optimized out by the compiler...and preserving that intermediate step must produce a delay, which is what I wanted.

--- Quote End ---

Well, of course the intermediate result can't be ignored in this case, but this is sure only for the expression brought to the pin. But it isn't necessarily used for another expression, that may use the intermediate result. The compiler will most likely ignore it, if it could represent it directly with the same LUT count. Apart from my assumptions, also the empirical results apparently shows, that there is no must for the intermediate step.

P.S.: Here is a simple example, how a delay chain can be defined in Quartus. I didn't check in hardware, but according to Technology Map Viewer and Timing Simulator, it operates as such. However, I don't want to suggest this technique for real designs.

library ieee;
use ieee.std_logic_1164.all;
entity chain is
  port 
  (
    inp : in  std_logic;
    outp: out std_logic
  );
end entity;
architecture rtl of chain is
signal wire1: std_logic;
signal wire2: std_logic;
signal wire3: std_logic;
signal wire4: std_logic;
attribute syn_keep: boolean;
attribute syn_keep of wire1: signal is true;
attribute syn_keep of wire2: signal is true;
attribute syn_keep of wire3: signal is true;
attribute syn_keep of wire4: signal is true;
begin
  wire1 <= inp;
  wire2 <= wire1;
  wire3 <= wire2;
  wire4 <= wire3;
  outp <= wire4 xor inp;
end;

Altera_Forum · ‎04-25-2008

The compiler is not a magician, however; it must separately calculate the "intermediate" combinatorial results if the design brings those results out to pins. The only way I can see the compiler getting around a chain (and the associated delay) is the parallel solution that gee showed. I don't have the expertise (or time at the moment to acquire it) to inspect the compiler's two approaches at the gate level, but I would bet that the optimized approach has parallel paths that use redundant gate structures. I don't see how else it can achieve simultaneous outputs. Once again, it's a tradeoff between area and speed. Perhaps, I'm missing something, though!

Altera_Forum · ‎04-25-2008

I expect the compiler to start generally with a parallel solution for each logic term. It may reuse existing terms, if this gives a LUT usage benefit and doesn't conflict with timing constraints or general optimization rules. Also different routing effort may be a reason to prefer a parallel solution in some cases.

Altera_Forum · ‎04-26-2008

TO_BE_DONE

Altera_Forum · ‎04-26-2008

The Quartus project referenced in my previous post is attached in cascaded_ands.zip.

The code that goes with my previous post:

library ieee;
use ieee.std_logic_1164.all;
entity cascaded_ands is
  port 
  (
    in_a1 : in  std_logic;
    in_a2 : in  std_logic;
    in_a3 : in  std_logic;
    in_a4 : in  std_logic;
    in_a5 : in  std_logic;
    in_a6 : in  std_logic;
    in_b1 : in  std_logic;
    in_b2 : in  std_logic;
    in_b3 : in  std_logic;
    in_b4 : in  std_logic;
    in_b5 : in  std_logic;
    in_b6 : in  std_logic;
    out_1and_withkeep_pin:     out std_logic;
    out_2ands_withkeep_pin:    out std_logic;
    out_3ands_withkeep_pin:    out std_logic;
    out_4ands_withkeep_pin:    out std_logic;
    out_5ands_withkeep_pin:    out std_logic;
    out_1and_withoutkeep_pin:  out std_logic;
    out_2ands_withoutkeep_pin: out std_logic;
    out_3ands_withoutkeep_pin: out std_logic;
    out_4ands_withoutkeep_pin: out std_logic;
    out_5ands_withoutkeep_pin: out std_logic
  );
end entity;
architecture rtl of cascaded_ands is
signal out_1and_withkeep:     std_logic;
signal out_2ands_withkeep:    std_logic;
signal out_3ands_withkeep:    std_logic;
signal out_4ands_withkeep:    std_logic;
signal out_5ands_withkeep:    std_logic;
signal out_1and_withoutkeep:  std_logic;
signal out_2ands_withoutkeep: std_logic;
signal out_3ands_withoutkeep: std_logic;
signal out_4ands_withoutkeep: std_logic;
signal out_5ands_withoutkeep: std_logic;
attribute keep: boolean;
attribute keep of out_1and_withkeep:  signal is true;
attribute keep of out_2ands_withkeep: signal is true;
attribute keep of out_3ands_withkeep: signal is true;
attribute keep of out_4ands_withkeep: signal is true;
attribute keep of out_5ands_withkeep: signal is true;
begin
  out_1and_withkeep  <= in_a1 and in_a2;
  out_2ands_withkeep <= in_a3 and out_1and_withkeep;
  out_3ands_withkeep <= in_a4 and out_2ands_withkeep;
  out_4ands_withkeep <= in_a5 and out_3ands_withkeep;
  out_5ands_withkeep <= in_a6 and out_4ands_withkeep;
  out_1and_withoutkeep  <= in_b1 and in_b2;
  out_2ands_withoutkeep <= in_b3 and out_1and_withoutkeep;
  out_3ands_withoutkeep <= in_b4 and out_2ands_withoutkeep;
  out_4ands_withoutkeep <= in_b5 and out_3ands_withoutkeep;
  out_5ands_withoutkeep <= in_b6 and out_4ands_withoutkeep;
  out_1and_withkeep_pin  <= out_1and_withkeep;
  out_2ands_withkeep_pin <= out_2ands_withkeep;
  out_3ands_withkeep_pin <= out_3ands_withkeep;
  out_4ands_withkeep_pin <= out_4ands_withkeep;
  out_5ands_withkeep_pin <= out_5ands_withkeep;
  out_1and_withoutkeep_pin  <= out_1and_withoutkeep;
  out_2ands_withoutkeep_pin <= out_2ands_withoutkeep;
  out_3ands_withoutkeep_pin <= out_3ands_withoutkeep;
  out_4ands_withoutkeep_pin <= out_4ands_withoutkeep;
  out_5ands_withoutkeep_pin <= out_5ands_withoutkeep;
end rtl;

Altera_Forum · ‎04-26-2008

Thank you for the illustrative examples. An interesting point is the deterministic rule behind the apparently arbitrarily cascaded LUT structure in the Stratix III case without speed optimization.