Re: Upsample/Downsample timing constraining

Altera_Forum · ‎10-03-2012

Dear All (specially KaZ and Rysc),

I have designed a multi-rate dsp system that works well in RTL Sims. Now I'm in the process of making it to work after synthesis. The application is similar to a time based radar system.

I have already constrained all the system clocks, inputs and outputs but timing is still quite off (about -100ns slack) as reported by TQ.

I am now constraining multi-cycle paths due to registers that delay control signals.

The majority of the system runs at 20Mhz and there is an upsample->RAM->FIR_filter->downsample data path running at 60Mhz.

My question is:

How do I constrain the clocks at the upsample (3X up) and downsample (3x down) operations?

I am currently using generic cross clock constraints but not sure if I am doing it right. Both clocks are genereted by a PLL. A sample sdc file follows.


#  clk=20Mhz   |  LAUNCH _|&#9472;----|_____|-----
#  clk=60Mhz  |  LATCH   _|-|_|-|_|-|_|-|_|-|_|-|_
#  UPSAMPLE
set_multicycle_path -from }] -to }] -setup -end 4
set_multicycle_path -from }] -to }] -hold  -end 3
#  DOWNSAMPLE
set_multicycle_path -from }] -to }] -setup -start 4
set_multicycle_path -from }] -to }] -hold  -start 3

Thanks much for any help!

Altera_Forum · ‎10-03-2012

It sounds odd to me that you can't pass timing for 60MHz clock which is way too slow for most modern devices. It is also hard to advise on constraints when your design is unknown. the slack of -100ns is astronomically bad. You better describe to the forum your clocking scheme carefully. One last thing if you use dspbuilder then you shouldn't worry about sdc as it takes over from you.

Altera_Forum · ‎10-03-2012

Thanks for answering Mr. Kaz,

I will try to clarify the design as much as possible but let me know if additional information is needed.

I am targeting EP3C120F780. I am using the simulink coder instead of dsp builder, which generates generic target independent and not that optimized rtl code.

The design has 28 lvds inputs running at 120Mhz (20Msps with 12 bits deser ratio). The deserializer module is manually implemented as per FvM suggestions in forums posts. One PLL generates all the clock for this part of the system.

All channels data is scaled with a multi-channel multiplier and then up-sampled (output is equal to the input one every N rising edges) in my case N is 3. The output of the up-sampler is: sample(1), 0, 0, sample(2), 0, 0, sample(3) and so on.

Each channel has a ram memory whose address are calculated on the fly at 20Mhz ( clk[2] ). The addresses are changed every 3 clock edges matching the up-sampling process.

At every rising edge of the 60Mhz clock, the ram has an output, all the data is summed up giving one output channel and goes into 60 tap fir filter. Then the down-sample process takes one sample out of 3 in time.

The addresses need a lot of calculations that have not been pipelined still.

In the case of the upsampler, the lanch clock is 20Mhz and the latch one is 60Mhz. In the case of the down sampling, the lanch clock is 60Mhz and the latch one is 20Mhz.

Hope there is a better understanding of my system now. In case not, please let me know how can I present a better question to the forum.

Thanks again

Altera_Forum · ‎10-03-2012

Dear,

My approach would be to tackle such a problem: run the whole design at 60 (or even 120)MHz and use a clockenable to get an effective data rate of 20 or 60MHz. This will make all (hard) cross clock domain timing issues disappear. Functional simulation will automatically match the real world system.

If you want to stick with the multi clock design all clocks have to come from the same oscillator. The settings to do are explained in the Q2 handbook chapter 7 : http://www.altera.com/literature/hb/qts/qts_qii53018.pdf page 7-47

About the huge 100ns negative slack: did you pipeline the system?

Altera_Forum · ‎10-04-2012

The notion of clock enable is indeed my choice but since your two clocks are in phase(and related from same PLL) then it should be ok. I am not clear about multicycle issue and it is one thing that I don't want to play with unless absolutely sure. Your timing violations seem more severe than tackling it with multicycles.

Few things I can suggest.

- Identify which paths are failing

- Your addressing must be pipelined if it takes long paths.

- add fabric registers to your multiplier outputs(apart from dedicated block registers). I find this very useful to shorten routing.

- adders should pipelined internally and at outputs.

as a side note: I am not clear about one functional issue. your upsampler inserts two zeros after each sample thus the data rate becomes 60Msps yet you write it to RAM at 20Msps. May be you have set your RAM to all zeros then you just write data implying that RAM is implementing the zero insertion rather than the upsampler?

Altera_Forum · ‎10-04-2012

Thanks you both of you for helping out!

I did not pipelined it yet, in order to investigate how much pipeline I needed although I started to think it was a bad idea! Is there any rule of thumb to use for pipelining? I guess the longer it may take the data path the more pipeline registers I need but I am open to your advice... What I am trying to ask is... How can I anticipate at design time how many pipeline stages I may need? At this time I am realizing it later in the design cycle...

Due to the recommendations I will go back to the one clock design. I am also starting to pipelining the address calculator and the fir filter.

Johannes,

What would be the benefit of running the design at 120Mhz instead of 60Mhz? I understand that this may reduce area using resource sharing... is there anything else?

Kaz,

Regarding your advice

--- Quote Start ---

- Identify which paths are failing

- Your addressing must be pipelined if it takes long paths.

- add fabric registers to your multiplier outputs (apart from dedicated block registers). I find this very useful to shorten routing.

- adders should pipelined internally and at outputs.

--- Quote End ---

- I am doing timing report to check failing paths although sometimes it is not that easy to see where really the problem is.

- I will pipeline the addressing engine. It has products, sums and sqrt so I made a mistake not pipelining it. However I hoped to run the design wo/pipeline as the speed was not that high (60Mhz)

- I am sorry but how do I add fabric registers? Do you mean to edit the design in chip planner?

- Most of the adders are being inferred but I may need to use synthesis attributes to let the tool know to pipeline the adders

Regarding the side note,

The address engine has a sample and hold circuit that holds the address for 3 cycles. So the effective address rate is also 60Mhz matching the ram data port. I am sorry I did not mention that before. Thanks you much for taking the time to carefully read my post!

Thanks both of you again and please let me know of any other thoughts!

Altera_Forum · ‎10-04-2012

Pipelining has no rule of thumb but I prefer to add registers every two or so logic levels for a high speed design.

You cannot I believe pipeline internal logic of inferred adders or multipliers.

Adders are made from fabric logic. Multipliers hopefully from dedicated blocks.

So you better instantiate adders with good pipeline.

To add registers to mult outputs, you can do that directly in your code or schematic i.e. don't take mult outputs to logic directly but through one stage registers.

If your address changes every 3 clocks then you got useful candidate for a multicycle of 3 on address registers.

Altera_Forum · ‎10-04-2012

--- Quote Start ---

Johannes,

What would be the benefit of running the design at 120Mhz instead of 60Mhz? I understand that this may reduce area using resource sharing... is there anything else?

--- Quote End ---

For the number of clock domains in your design their is a simple rule of thumb: the less the easier! As your input clock is 120MHz you can run the whole design at that speed with clock enables for the 60 and 20MHz. On the downside is that you then overconstrain the 20 and 60MHz part. It is a design=tradeoff.

I would advice you to get some feeling with fpga's first: try to do a simple adder, multiplier, mac and see how it is implemented, how many cells it takes what the maximum speed is etc... Take extreme care if you start with devisions, modulos or sqrt, they are totally not efficient to implement...

Good luck