DSP Builder Advanced Error Messages

Pipelining and Scheduling

Warning: Negative IO correction on block <design>/<subsystem>/ChannelOut. Simulink will not match hardware

See 4.5.2 in DSP Builder Advanced Blockset - Flow Control, Design Style and Floating Point .

Note that in some cases this accounting for the latency may give different results to hardware – in particular a) in the case of optimizing away parallel registering that just delays all signals by the same amount and b) in the case of multiple inputs and outputs where an input is scheduled after an output.

In these cases a warning is given:

“Warning: Negative IO correction on block <design>/<subsystem>/ChannelOut. Simulink will not match hardware.”

Optimizing away parallel registering

Suppose a primitive subsystem design had a line of registers on each input to output path (see the example below). These registers do not alter the function of what the primitive subsystem does – only alter when the output comes out – i.e. the latency.

Advanced Blockset seeks to optimize the hardware it produces, while minimizing the latency required to achieve the same functionality and still achieving the desired fmax. SampleDelays that specify relative differences in latency on paths are therefore important, but cuts of SampleDelays that specify identical latency on all paths are effectively ignored.

8/8c/FAQ3a.PNG ( FAQ3a.PNG - click here to view image )

As such, the registers in the above design which delay all paths by 1 cycle would be optimized away – as they can be removed without changing the functionality of the subsystem (only its latency).

In this case the generated hardware contains no registers – and has zero latency – but the Simulink model of the subsystem as a whole will have a latency of 1 from the Sample Delays. The ChannelOut block cannot correct to simulate the actual latency of the hardware produced by adding negative latency and displays the message:

“Warning: Negative IO correction on block <design>/<subsystem>/ChannelOut. Simulink will not match hardware.”

If you want the latency to be higher than the optimal (lowest latency) solution the way to do this is to use the constraints on the SynthesisInfo block:

6/62/FAQ3b.PNG ( FAQ3b.PNG - click here to view image )

Here the latency for this subsystem is constrained to be greater or equal to 1. Now on generation the vertical line of pipelining is preserved in hardware such that simulation and hardware both have latency 1 and there is no error message. (Note; the SampleDelays aren’t needed at all in this case - it is sufficient to use the constraint alone to add latency).

Here is another example.

9/98/FAQ3c.PNG ( FAQ3c.PNG - click here to view image )

Remember that adding vertical cuts of delays across all paths is not the recommended way of adding latency – if that is really what is desired. The way to do this would be to constrain the latency to the higher value using the SynthesisInfo block.

DSP Builder Advanced solves for the minimal pipelining delay required to attain the target fmax while maintaining the relative delay differences implied by the user-inserted SampleDelay blocks.

Adding vertical cuts of extra SampleDelays across all signals does not change the relative latency of the signals so does not alter the optimization problem, or the hardware that would be produced.

In the design shown there are no relative differences in the latencies between the paths (each has 10), so the optimization is free to remove them, and the hardware produced for this design will be the same as if no SampleDelays were specified. In this example, to be sure of attaining the desired fmax the multiplier needs a latency of 4 clock cycles (that is 4 pipelining registers), and this will be balanced by delays inserted on the other paths (the valid and the channel signal) to maintain no relative latency difference.

The final solution will therefore be that the HDL will have a multiplier with 4 pipelining stages, to meet timing on the data path, and 4 registers on each of the valid and channel paths, such that the outputs maintain their relative synchronization.

So to simulate this cycle accurately in Simulink the tool would have to delay the signal by a total of 4 clock cycles across the subsystem. However, the Simulink schematic design already has SampleDelays that are delaying the signal by 10 clock cycles – and the tool cannot correct for negative delays. It can’t reset these Sample Delays in simulation, or jump the simulation forward 6 cycles (applying a negative correction) in the ChannelOut to compensate.

What you will get now is zero latency correction applied in the ChannelOut block, and the warning given;

“Warning: Negative IO correction on block <design>/<subsystem>/ChannelOut. Simulink will not match hardware.”

The above designs break two ‘good design rules’ that should be followed when designing with Advanced Blockset

a) Don’t try to pipeline the design yourself – this is what the tool does for you using mathematical linear programming techniques.

b) Don’t try to synchronize the output of different parallel subsystems using explicit delays. Use FIFOs, as this will give a more device-portable, fmax target independent design.

Multiple Scheduled Outputs

0/07/FAQ3d.PNG ( FAQ3d.PNG - click here to view image )

Consider the above model. Here there is a single set of inputs, scheduled to be input on the same clock cycle. Any pipelining inserted ensures these signals stay in sync. For the particular fmax chosen in this example the multipliers require a latency of 3 (that is 3 pipelining stages). Parallel signals are pipelined by the same amount to keep the signals in sync. Thus on ‘ChannelOut’ there are three cycles of latency added to all signals through this I/O block and scheduled to be output together. For the ‘ChanelOut1’ signals however, the latency of the path is 6 clock cycles (3 for each multiplier), so the output here has a latency of 6 – appearing three cycles after the outputs from ‘ChannelOut’.

The latencies are modelled by delaying the signals through ‘ChannelOut’ by 3 cycles, and through ‘ChannelOut1’ by 6 clock cycles.

Multiple Scheduled Inputs & Outputs

Things are a little more complex in the case of multiple inputs and multiple outputs. Models such as demo_back_pressure have more than one input block and more than one output block in the single primitive subsystem.

Models with multiple input blocks imply that in hardware the data entering the system will not be synchronised between the different points of entry.

f/fc/FAQ3e.PNG ( FAQ3e.PNG - click here to view image )

Simulink is unaware of the 1 cycle latency needed by the adders. This has to be worked out in the scheduler and corrected for afterwards during simulation using simulation delays in the input/output blocks A, B, C, X and Y.

This model implies the following constraints:

A + X = 1

A + Y = 0

B + X = 2

B + Y = 1

C + X = 1

C + Y = 1

2/23/FAQ3f.PNG ( FAQ3f.PNG - click here to view image )

This system of equations has no solution! However, this situation never actually occurs because of the scheduling constraints that have been enforced over the primitive subsystem. The scheduler actually inserts an extra sample delay between Add2 and ChannelOut X which makes the whole system solvable.

Note that there are some configurations of input and output blocks that, if accounting for latency at the inputs and outputs, would require latency correction of negative size, implying discarding the first few samples. Consider:

6/66/FAQ3g.PNG ( FAQ3g.PNG - click here to view image )

Assume that multipliers have a delay need 3 times that of adders. Any solution for latency correction requires a negative sized buffer on at least one of the input/output blocks. A bigger example of this is demo_back_pressure.

Discarding the first few samples is not possible in the discrete simulation that Simulink uses, however this can still be done for the stimulus files. When this occurs a warning message is given to make the user aware that their Simulink model will not behave exactly as the hardware does.

Failed to distribute memory in your design

f/f4/FAQ5.PNG ( FAQ5.PNG - click here to view image )

The tools automatically inserts pipelining required to meet the chosen clock speed, based on internal timing models. For example to run an 18bit multiplier at 400MHz, our timing models may suggest that 3 registering stages are required across the multiplier block.

Suppose you have a feedback loop where a latency of 5 clock cycles is specified (for example if your data is for 5 channels in sequence). We can satisfy both criteria: pipelining for fmax and re-circulating the data in 5 clock cycles. Rather than adding the 5 cycles of delay specified by the SampleDelay as 5 new registers, 3 of those delays will be formed from the delay required across the multiplier and only 2 will be implemented as external registers. The delay has been ‘distributed’ around the loop. More complex, multiply nested loops require more complex delay redistributions – but all of this is solved using standard mathematical linear programming techniques.

Now suppose you have a feedback loop where a latency of 1 clock cycle is specified. The two criteria: pipeline to achieve fmax (implying a latency of at least 3 clock cycles) and the loop criteria (re-circulate the data in 1 clock cycle), cannot simultaneously be satisfied, no matter where we distribute the 1 cycle of delay specified. In this case an error is given:

Failed to distribute memory in your design. Found insufficient delay attempting to satisfy fMax requirement for [subsystem]. Failed to satisfy the following latency constraints: (ParallelPathPair 0): Mult<3> SampleDelay (2 cycles deficient)

What this says is that the design as specified is imposing a restriction on the pipelining that is required to meet the clock speed requirement, such that both cannot be satisfied simultaneously.

It may be possible in some cases to re-implement the algorithm to avoid loops, or to run the designs faster but push the data through at the same rate. (for example, if running at 100MHz with a new data sample every clock cycle, instead run at 300MHz and have a new data sample every 3 clock cycles [DSPBA optimizations and timing characterization is currently targeted at high clock rates]. Folding (manual or automatic) can then be used to reduce hardware resources elsewhere in the design if clock rate > sample rate.

**Failed to determine latency of variable SampleDelay <name> in post schedule. Check the equivalence group.**

For some designs - for example FIFOs are used in feedback loops - the user may want to allow the tool to set the minimum feedback latency round the loop, subject perhaps to some equivalence to other loops to ensure synchronization. This is accomplished via the "Minimum Delay" and Equivalence Group" options on the ModelPrim SampleDelay block. As part of the scheduling optimization process the tool will seek to insert the minimum number of registers required to run the loop at the desired clock frequency, while maintaining synchronization with other loops in the specified 'Equivalence Group'. Under some circumstances it could be that there is no common loop delay can be found that meets this criterion; however this is unlikely to occur if Equivalence Groups are set sensibly.

**Failed to manage delay in Synthesis Block <name> extra info: <message>**

This error can occur when a Latency Constraint is specified which attempts to restrict the pipelining depth to less than that required to achieve the specified clock frequency Currently the Latency Constraint feature of the Synthesis Info block can only be used to add latency above that required for pipelining to achieve the designs fmax. If you want to attempt to reduce latency of a particular subsystem try locally lowering the target clock frequency driving the pipelining calculations by using a LocalThreshold block with a negative Clock Margin.